- Notifications
You must be signed in to change notification settings - Fork6.3k
Alerts doc rewrite#21333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Alerts doc rewrite#21333
Uh oh!
There was an error while loading.Please reload this page.
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
8 issues found across 18 files
Prompt for AI agents (all 8 issues)
Understand the root cause of the following 8 issues and fix them.<file name="docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md"><violation number="1" location="docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md:5">The `:::tip` block that explains why status values are 1/3/4 is never closed before the following `:::note`, so the admonition markup is unbalanced and renders incorrectly.</violation><violation number="2" location="docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md:127">`calc` converts `$mem_raw` into a boolean, so `$this` is never above 4M and the example alert can never fire.</violation></file><file name="docs/alerts/alert-configuration-syntax/calculations-and-transformations.md"><violation number="1" location="docs/alerts/alert-configuration-syntax/calculations-and-transformations.md:90">The swap alert example labels the result as `MB/s`, but the stock alert actually reports a percentage of RAM swapped out, so the documented units are incorrect.</violation><violation number="2" location="docs/alerts/alert-configuration-syntax/calculations-and-transformations.md:91">The documented warning threshold (`$this > 200`) does not match the real stock alert, which warns once the percentage exceeds roughly 20–30; documenting the wrong cutoff misleads anyone copying the example.</violation><violation number="3" location="docs/alerts/alert-configuration-syntax/calculations-and-transformations.md:92">This example adds a critical threshold that the underlying stock alert does not define, so the doc is attributing behavior that doesn’t exist.</violation></file><file name="docs/alerts/alert-configuration-syntax/optional-metadata.md"><violation number="1" location="docs/alerts/alert-configuration-syntax/optional-metadata.md:176">The workflow example uses `type: latency`, but `latency` is a `class` value; using it under `type` contradicts the taxonomy defined earlier in the document and misdirects readers.</violation><violation number="2" location="docs/alerts/alert-configuration-syntax/optional-metadata.md:193">Section 3.6.6 Step 2 recommends `type` values like `latency`/`error`, contradicting the earlier definition of `type` as a functional domain (System, Database, etc.), which will confuse alert authors.</violation></file><file name="docs/alerts/creating-alerts-pages/creating-and-editing-alerts-via-config-files.md"><violation number="1" location="docs/alerts/creating-alerts-pages/creating-and-editing-alerts-via-config-files.md:166">Chart IDs are documented as never using underscores even though real Netdata chart IDs (for example `disk_space._run`) contain underscores; this misleads readers trying to locate the correct `chart` name.</violation></file>Reply to cubic to teach it or ask questions. Re-run a review with@cubic-dev-ai review this PR
| Alert expressions in Netdata can reference variables that represent metric values, alert state, time information, and chart metadata. Understanding which variables are available and how to discover them is essential for writing effective alerts. | ||
| :::tip |
cubic-dev-aibotNov 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The:::tip block that explains why status values are 1/3/4 is never closed before the following:::note, so the admonition markup is unbalanced and renders incorrectly.
Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md at line 5:<comment>The `:::tip` block that explains why status values are 1/3/4 is never closed before the following `:::note`, so the admonition markup is unbalanced and renders incorrectly.</comment><file context>@@ -0,0 +1,607 @@++Alert expressions in Netdata can reference variables that represent metric values, alert state, time information, and chart metadata. Understanding which variables are available and how to discover them is essential for writing effective alerts.++:::tip++Refer to this section when you're writing `calc`, `warn`, or `crit` expressions and need to know which variables exist, debugging alerts that reference missing or incorrect variable names, building alerts that combine multiple dimensions or chart metadata, or using the `alarm_variables` API to explore what's available on a chart.</file context>| **Example:** | ||
| ```conf | ||
| on: system.mem | ||
| calc: $mem_raw > 5000000 # Raw collected value in KiB |
cubic-dev-aibotNov 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
calc converts$mem_raw into a boolean, so$this is never above 4M and the example alert can never fire.
Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md at line 127:<comment>`calc` converts `$mem_raw` into a boolean, so `$this` is never above 4M and the example alert can never fire.</comment><file context>@@ -0,0 +1,607 @@+**Example:**+```conf+on: system.mem+calc: $mem_raw > 5000000 # Raw collected value in KiB+warn: $this > 4000000+```</file context>| calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free ) | ||
| units: MB/s | ||
| warn: $this > 200 | ||
| crit: $this > 400 |
cubic-dev-aibotNov 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This example adds a critical threshold that the underlying stock alert does not define, so the doc is attributing behavior that doesn’t exist.
Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/calculations-and-transformations.md at line 92:<comment>This example adds a critical threshold that the underlying stock alert does not define, so the doc is attributing behavior that doesn’t exist.</comment><file context>@@ -0,0 +1,352 @@+ calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )+ units: MB/s+ warn: $this > 200+ crit: $this > 400+```+</file context>| lookup: sum -30m unaligned absolute of out | ||
| calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free ) | ||
| units: MB/s | ||
| warn: $this > 200 |
cubic-dev-aibotNov 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The documented warning threshold ($this > 200) does not match the real stock alert, which warns once the percentage exceeds roughly 20–30; documenting the wrong cutoff misleads anyone copying the example.
Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/calculations-and-transformations.md at line 91:<comment>The documented warning threshold (`$this > 200`) does not match the real stock alert, which warns once the percentage exceeds roughly 20–30; documenting the wrong cutoff misleads anyone copying the example.</comment><file context>@@ -0,0 +1,352 @@+lookup: sum -30m unaligned absolute of out+ calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )+ units: MB/s+ warn: $this > 200+ crit: $this > 400+```</file context>| # From swap.conf | ||
| lookup: sum -30m unaligned absolute of out | ||
| calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free ) | ||
| units: MB/s |
cubic-dev-aibotNov 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The swap alert example labels the result asMB/s, but the stock alert actually reports a percentage of RAM swapped out, so the documented units are incorrect.
Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/calculations-and-transformations.md at line 90:<comment>The swap alert example labels the result as `MB/s`, but the stock alert actually reports a percentage of RAM swapped out, so the documented units are incorrect.</comment><file context>@@ -0,0 +1,352 @@+# From swap.conf+lookup: sum -30m unaligned absolute of out+ calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )+ units: MB/s+ warn: $this > 200+ crit: $this > 400</file context>Uh oh!
There was an error while loading.Please reload this page.
| 2.**Define recommended`type` values aligned with your SRE vocabulary** | ||
| Example:`latency`,`error`,`saturation`,`traffic`,`anomaly` |
cubic-dev-aibotNov 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Section 3.6.6 Step 2 recommendstype values likelatency/error, contradicting the earlier definition oftype as a functional domain (System, Database, etc.), which will confuse alert authors.
Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/optional-metadata.md at line 193:<comment>Section 3.6.6 Step 2 recommends `type` values like `latency`/`error`, contradicting the earlier definition of `type` as a functional domain (System, Database, etc.), which will confuse alert authors.</comment><file context>@@ -0,0 +1,246 @@++2. **Define recommended `type` values aligned with your SRE vocabulary**+ + Example: `latency`, `error`, `saturation`, `traffic`, `anomaly`++3. **Map `component` values to your services or teams**</file context>✅ Addressed ind30b43b
| - **Dashboard:** Hover over the date on a chart and check the tooltip | ||
| - **API:** `curl "http://localhost:19999/api/v1/charts" | jq '.charts[] | {id, context}'` | ||
| Both use dots (`.`), not underscores (`_`), in their names. |
cubic-dev-aibotNov 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Chart IDs are documented as never using underscores even though real Netdata chart IDs (for exampledisk_space._run) contain underscores; this misleads readers trying to locate the correctchart name.
Prompt for AI agents
Address the following comment on docs/alerts/creating-alerts-pages/creating-and-editing-alerts-via-config-files.md at line 166:<comment>Chart IDs are documented as never using underscores even though real Netdata chart IDs (for example `disk_space._run`) contain underscores; this misleads readers trying to locate the correct `chart` name.</comment><file context>@@ -0,0 +1,362 @@+- **Dashboard:** Hover over the date on a chart and check the tooltip+- **API:** `curl "http://localhost:19999/api/v1/charts" | jq '.charts[] | {id, context}'`++Both use dots (`.`), not underscores (`_`), in their names.++:::</file context>| Both use dots (`.`),notunderscores (`_`), in their names. | |
| +Contexts use dots (`.`),while chart IDs may include dots orunderscores (for example,`disk_space._run`). |
This comment was marked as resolved.
This comment was marked as resolved.
ralphm left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is my review of "chapter 1".
General comments:
- I'd prefer if we could focus more on the dashboard or UI, rather than specifically calling out local and Cloud in multiple places, unless there is a difference. E.g. on the local dashboard the Events tab shows the alert history, too, so for all intents and purposes alert transitions can be seen in the Events tab.
- In many places we refer to Agent/Parent, as if those are separate things. Can we just talk about Agents in general, and only point out Parents specifically when needed?
- In (un)ordered lists, each item should generally end in a full stop when they are full sentences.
| | Core Concept| Description| | ||
| |--------------|-------------| | ||
| |**Where alerts are evaluated**| Locally on each Netdata Agent or Parent node, not centrally in Netdata Cloud| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd phrase this as "Locally on an Agent, [..]". The description below explains in more detail.
| | Core Concept| Description| | ||
| |--------------|-------------| | ||
| |**Where alerts are evaluated**| Locally on each Netdata Agent or Parent node, not centrally in Netdata Cloud| | ||
| |**How alerts work**| Each alert inspects recent metric data (for example, "average CPU over the last 5 minutes") and decides whether the system is in a healthy state (`CLEAR`), needs attention (`WARNING`), or has a problem (`CRITICAL`)| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm not sure if "recent" needs a mention here. I'd mention "locally stored metrics data". Note the plural in metrics, as "metric data" may be confused with data using the international metric system (SI).
| |--------------|-------------| | ||
| |**Where alerts are evaluated**| Locally on each Netdata Agent or Parent node, not centrally in Netdata Cloud| | ||
| |**How alerts work**| Each alert inspects recent metric data (for example, "average CPU over the last 5 minutes") and decides whether the system is in a healthy state (`CLEAR`), needs attention (`WARNING`), or has a problem (`CRITICAL`)| | ||
| |**What happens on status change**| When an alert's status changes, that transition becomes an**alert event** visible in the Agent dashboard, APIs, and Netdata Cloud's Events Feed| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The transition event does not become visible in the dashboard or APIs. The status is. The eventsdo become visible in the Events tab.
Also they may result in notifications (depending on the configuration of notification integrations). I think this must be mentioned.
I am not sure why this mentions theAgent dashboard, specifically. Using this phrasing, what about Cloud?
| ##Where Alerts Run | ||
| -**Agents** evaluate alerts against their own collected metrics | ||
| -**Parents** (in streaming setups) can evaluate alerts on metrics received from child nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Parents also include Agents set up to collect metrics on behalf of a virtual node, so not just streaming setups.
| -**Agents** evaluate alerts against their own collected metrics | ||
| -**Parents** (in streaming setups) can evaluate alerts on metrics received from child nodes | ||
| -**Netdata Cloud** receives alert events from Agents/Parents and presents them in a unified view, but does**not** re-evaluate the rules itself |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd say alert events are received from Agents (without calling out Parents), but then we probably should mention that if multiple Agents represent a given node, their reported statuses are consolidated for that node.
| **Location:** Netdata Cloud backend + `/var/lib/netdata/config/` on agents | ||
| When you create or edit an alert using the **Alerts Configuration Manager** in Netdata Cloud: | ||
| - The alert definition is **stored in Netdata Cloud** as the source of truth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is not correct. I commented on this elsewhere.
| When you create or edit an alert using the **Alerts Configuration Manager** in Netdata Cloud: | ||
| - The alert definition is **stored in Netdata Cloud** as the source of truth | ||
| - Cloud **pushes the definition** to connected agents at runtime via the ACLK | ||
| - Agents **persist it to disk** at `/var/lib/netdata/config/*.dyncfg` for reliability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We should either leave out this detail, or explicitly tell people that manually editing the files there is completely unsupported and the outcome undefined.
| **Key characteristics:** | ||
| - **Zero-touch rollout** Create an alert once in Cloud, it applies to all nodes in that space instantly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't know what this means.
| | Alert Type | Best Used When | Common Use Cases | | ||
| |------------|----------------|------------------| | ||
| | **Cloud-defined alerts** | • You want centralized management across many nodes<br/>• You need instant rollout of new alerts or threshold changes<br/>• You prefer a UI workflow over editing config files<br/>• You want to leverage Cloud's deduplication and notification routing | Standard monitoring across your fleet | | ||
| | **File-based alerts** | • You need version control for alert definitions (Git, etc.)<br/>• You want full control over local configuration<br/>• You need custom syntax not yet supported in the Cloud UI<br/>• You prefer infrastructure-as-code workflows | Node-specific or advanced configurations | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think we should mention the configuration management systems aspect here, too (e.g. Ansible, Helm, etc.). Not just version control.
| 2. **Custom alerts** are loaded next from `/etc/netdata/health.d/` | ||
| - If a custom alert has the **same name** as a stock alert, the custom version **overrides** it | ||
| 3. **Cloud-defined alerts** are loaded at runtime and **coexist** with file-based alerts | ||
| - Cloud alerts use unique identifiers and typically don't conflict with file-based alert names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think "Cloud-defined alerts" is misleading. The only Cloud thing about it is that you use the Cloud UI to push the alert definition to one or more Agents. Also, people need provide the name of an alert themselves. They are not magically unique. Alert definitions with the same name as a stock or custom (file-based) alert on the same Agent will override that definition.
ralphm commentedNov 26, 2025
Another observation: I see that the "main pages" for the chapters have the same name as the directory the subpages are in. Wouldn't it be better to move those into the directory as |
Ancairon commentedNov 27, 2025
@ralphm this is on purpose I think, as that's how you make the a folder be clickable and have something like an overview on Learn |
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Uh oh!
There was an error while loading.Please reload this page.
hey,@ralphm ,@ktsaou ,@sashwathn,@shyamvalsan
Summary by cubic
Rewrote and expanded the Alerts docs with a practical guide and a complete syntax reference to make alert creation easier.
New Features
Refactors
Written for commitd30b43b. Summary will update automatically on new commits.