Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Alerts doc rewrite#21333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
kanelatechnical wants to merge24 commits intonetdata:master
base:master
Choose a base branch
Loading
fromkanelatechnical:alerts-doc-rewrite

Conversation

@kanelatechnical
Copy link
Contributor

@kanelatechnicalkanelatechnical commentedNov 21, 2025
edited by cubic-dev-aibot
Loading

hey,@ralphm ,@ktsaou ,@sashwathn,@shyamvalsan

  1. all will be crosslinked, of course
  2. suggestion > comment
  3. maybe different reviewer for a different chapter so that this moves more efficiently and it's less scary to approach
  4. Git puts the files in alphabetical order, truly sorry

Summary by cubic

Rewrote and expanded the Alerts docs with a practical guide and a complete syntax reference to make alert creation easier.

  • New Features

    • New Alerts docs structure: Understanding, Creating and Managing, and Alert Configuration Syntax.
    • Added guides: quick start, Cloud UI flow, file-based flow, stock vs custom, and reload/validate.
    • Deep reference: definition lines, lookup/time windows, calc/transformations, expressions/operators/functions, variables, and metadata.
  • Refactors

    • Moved “Creating Alerts” into creating-alerts-pages/ for clearer IA.

Written for commitd30b43b. Summary will update automatically on new commits.

@github-actionsgithub-actionsbot added area/docs area/collectorsEverything related to data collection labelsNov 21, 2025
Copy link
Contributor

@cubic-dev-aicubic-dev-aibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

8 issues found across 18 files

Prompt for AI agents (all 8 issues)
Understand the root cause of the following 8 issues and fix them.<file name="docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md"><violation number="1" location="docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md:5">The `:::tip` block that explains why status values are 1/3/4 is never closed before the following `:::note`, so the admonition markup is unbalanced and renders incorrectly.</violation><violation number="2" location="docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md:127">`calc` converts `$mem_raw` into a boolean, so `$this` is never above 4M and the example alert can never fire.</violation></file><file name="docs/alerts/alert-configuration-syntax/calculations-and-transformations.md"><violation number="1" location="docs/alerts/alert-configuration-syntax/calculations-and-transformations.md:90">The swap alert example labels the result as `MB/s`, but the stock alert actually reports a percentage of RAM swapped out, so the documented units are incorrect.</violation><violation number="2" location="docs/alerts/alert-configuration-syntax/calculations-and-transformations.md:91">The documented warning threshold (`$this &gt; 200`) does not match the real stock alert, which warns once the percentage exceeds roughly 20–30; documenting the wrong cutoff misleads anyone copying the example.</violation><violation number="3" location="docs/alerts/alert-configuration-syntax/calculations-and-transformations.md:92">This example adds a critical threshold that the underlying stock alert does not define, so the doc is attributing behavior that doesn’t exist.</violation></file><file name="docs/alerts/alert-configuration-syntax/optional-metadata.md"><violation number="1" location="docs/alerts/alert-configuration-syntax/optional-metadata.md:176">The workflow example uses `type: latency`, but `latency` is a `class` value; using it under `type` contradicts the taxonomy defined earlier in the document and misdirects readers.</violation><violation number="2" location="docs/alerts/alert-configuration-syntax/optional-metadata.md:193">Section 3.6.6 Step 2 recommends `type` values like `latency`/`error`, contradicting the earlier definition of `type` as a functional domain (System, Database, etc.), which will confuse alert authors.</violation></file><file name="docs/alerts/creating-alerts-pages/creating-and-editing-alerts-via-config-files.md"><violation number="1" location="docs/alerts/creating-alerts-pages/creating-and-editing-alerts-via-config-files.md:166">Chart IDs are documented as never using underscores even though real Netdata chart IDs (for example `disk_space._run`) contain underscores; this misleads readers trying to locate the correct `chart` name.</violation></file>

Reply to cubic to teach it or ask questions. Re-run a review with@cubic-dev-ai review this PR


Alert expressions in Netdata can reference variables that represent metric values, alert state, time information, and chart metadata. Understanding which variables are available and how to discover them is essential for writing effective alerts.

:::tip
Copy link
Contributor

@cubic-dev-aicubic-dev-aibotNov 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The:::tip block that explains why status values are 1/3/4 is never closed before the following:::note, so the admonition markup is unbalanced and renders incorrectly.

Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md at line 5:<comment>The `:::tip` block that explains why status values are 1/3/4 is never closed before the following `:::note`, so the admonition markup is unbalanced and renders incorrectly.</comment><file context>@@ -0,0 +1,607 @@++Alert expressions in Netdata can reference variables that represent metric values, alert state, time information, and chart metadata. Understanding which variables are available and how to discover them is essential for writing effective alerts.++:::tip++Refer to this section when you&#39;re writing `calc`, `warn`, or `crit` expressions and need to know which variables exist, debugging alerts that reference missing or incorrect variable names, building alerts that combine multiple dimensions or chart metadata, or using the `alarm_variables` API to explore what&#39;s available on a chart.</file context>
Fix with Cubic

**Example:**
```conf
on: system.mem
calc: $mem_raw > 5000000 # Raw collected value in KiB
Copy link
Contributor

@cubic-dev-aicubic-dev-aibotNov 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

calc converts$mem_raw into a boolean, so$this is never above 4M and the example alert can never fire.

Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/variables-and-special-symbols.md at line 127:<comment>`calc` converts `$mem_raw` into a boolean, so `$this` is never above 4M and the example alert can never fire.</comment><file context>@@ -0,0 +1,607 @@+**Example:**+```conf+on: system.mem+calc: $mem_raw &gt; 5000000  # Raw collected value in KiB+warn: $this &gt; 4000000+```</file context>
Fix with Cubic

calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )
units: MB/s
warn: $this > 200
crit: $this > 400
Copy link
Contributor

@cubic-dev-aicubic-dev-aibotNov 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This example adds a critical threshold that the underlying stock alert does not define, so the doc is attributing behavior that doesn’t exist.

Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/calculations-and-transformations.md at line 92:<comment>This example adds a critical threshold that the underlying stock alert does not define, so the doc is attributing behavior that doesn’t exist.</comment><file context>@@ -0,0 +1,352 @@+  calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )+ units: MB/s+  warn: $this &gt; 200+  crit: $this &gt; 400+```+</file context>
Fix with Cubic

lookup: sum -30m unaligned absolute of out
calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )
units: MB/s
warn: $this > 200
Copy link
Contributor

@cubic-dev-aicubic-dev-aibotNov 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The documented warning threshold ($this > 200) does not match the real stock alert, which warns once the percentage exceeds roughly 20–30; documenting the wrong cutoff misleads anyone copying the example.

Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/calculations-and-transformations.md at line 91:<comment>The documented warning threshold (`$this &gt; 200`) does not match the real stock alert, which warns once the percentage exceeds roughly 20–30; documenting the wrong cutoff misleads anyone copying the example.</comment><file context>@@ -0,0 +1,352 @@+lookup: sum -30m unaligned absolute of out+  calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )+ units: MB/s+  warn: $this &gt; 200+  crit: $this &gt; 400+```</file context>
Fix with Cubic

# From swap.conf
lookup: sum -30m unaligned absolute of out
calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )
units: MB/s
Copy link
Contributor

@cubic-dev-aicubic-dev-aibotNov 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The swap alert example labels the result asMB/s, but the stock alert actually reports a percentage of RAM swapped out, so the documented units are incorrect.

Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/calculations-and-transformations.md at line 90:<comment>The swap alert example labels the result as `MB/s`, but the stock alert actually reports a percentage of RAM swapped out, so the documented units are incorrect.</comment><file context>@@ -0,0 +1,352 @@+# From swap.conf+lookup: sum -30m unaligned absolute of out+  calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free )+ units: MB/s+  warn: $this &gt; 200+  crit: $this &gt; 400</file context>
Fix with Cubic


2.**Define recommended`type` values aligned with your SRE vocabulary**

Example:`latency`,`error`,`saturation`,`traffic`,`anomaly`
Copy link
Contributor

@cubic-dev-aicubic-dev-aibotNov 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Section 3.6.6 Step 2 recommendstype values likelatency/error, contradicting the earlier definition oftype as a functional domain (System, Database, etc.), which will confuse alert authors.

Prompt for AI agents
Address the following comment on docs/alerts/alert-configuration-syntax/optional-metadata.md at line 193:<comment>Section 3.6.6 Step 2 recommends `type` values like `latency`/`error`, contradicting the earlier definition of `type` as a functional domain (System, Database, etc.), which will confuse alert authors.</comment><file context>@@ -0,0 +1,246 @@++2. **Define recommended `type` values aligned with your SRE vocabulary**+   +   Example: `latency`, `error`, `saturation`, `traffic`, `anomaly`++3. **Map `component` values to your services or teams**</file context>

✅ Addressed ind30b43b

- **Dashboard:** Hover over the date on a chart and check the tooltip
- **API:** `curl "http://localhost:19999/api/v1/charts" | jq '.charts[] | {id, context}'`

Both use dots (`.`), not underscores (`_`), in their names.
Copy link
Contributor

@cubic-dev-aicubic-dev-aibotNov 21, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Chart IDs are documented as never using underscores even though real Netdata chart IDs (for exampledisk_space._run) contain underscores; this misleads readers trying to locate the correctchart name.

Prompt for AI agents
Address the following comment on docs/alerts/creating-alerts-pages/creating-and-editing-alerts-via-config-files.md at line 166:<comment>Chart IDs are documented as never using underscores even though real Netdata chart IDs (for example `disk_space._run`) contain underscores; this misleads readers trying to locate the correct `chart` name.</comment><file context>@@ -0,0 +1,362 @@+- **Dashboard:** Hover over the date on a chart and check the tooltip+- **API:** `curl &quot;http://localhost:19999/api/v1/charts&quot; | jq &#39;.charts[] | {id, context}&#39;`++Both use dots (`.`), not underscores (`_`), in their names.++:::</file context>
Suggested change
Both use dots (`.`),notunderscores (`_`), in their names.
+Contexts use dots (`.`),while chart IDs may include dots orunderscores (for example,`disk_space._run`).
Fix with Cubic

@ilyam8

This comment was marked as resolved.

@github-actionsgithub-actionsbot removed the area/collectorsEverything related to data collection labelNov 21, 2025
Copy link
Member

@ralphmralphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is my review of "chapter 1".

General comments:

  • I'd prefer if we could focus more on the dashboard or UI, rather than specifically calling out local and Cloud in multiple places, unless there is a difference. E.g. on the local dashboard the Events tab shows the alert history, too, so for all intents and purposes alert transitions can be seen in the Events tab.
  • In many places we refer to Agent/Parent, as if those are separate things. Can we just talk about Agents in general, and only point out Parents specifically when needed?
  • In (un)ordered lists, each item should generally end in a full stop when they are full sentences.


| Core Concept| Description|
|--------------|-------------|
|**Where alerts are evaluated**| Locally on each Netdata Agent or Parent node, not centrally in Netdata Cloud|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd phrase this as "Locally on an Agent, [..]". The description below explains in more detail.

| Core Concept| Description|
|--------------|-------------|
|**Where alerts are evaluated**| Locally on each Netdata Agent or Parent node, not centrally in Netdata Cloud|
|**How alerts work**| Each alert inspects recent metric data (for example, "average CPU over the last 5 minutes") and decides whether the system is in a healthy state (`CLEAR`), needs attention (`WARNING`), or has a problem (`CRITICAL`)|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not sure if "recent" needs a mention here. I'd mention "locally stored metrics data". Note the plural in metrics, as "metric data" may be confused with data using the international metric system (SI).

|--------------|-------------|
|**Where alerts are evaluated**| Locally on each Netdata Agent or Parent node, not centrally in Netdata Cloud|
|**How alerts work**| Each alert inspects recent metric data (for example, "average CPU over the last 5 minutes") and decides whether the system is in a healthy state (`CLEAR`), needs attention (`WARNING`), or has a problem (`CRITICAL`)|
|**What happens on status change**| When an alert's status changes, that transition becomes an**alert event** visible in the Agent dashboard, APIs, and Netdata Cloud's Events Feed|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The transition event does not become visible in the dashboard or APIs. The status is. The eventsdo become visible in the Events tab.

Also they may result in notifications (depending on the configuration of notification integrations). I think this must be mentioned.

I am not sure why this mentions theAgent dashboard, specifically. Using this phrasing, what about Cloud?

##Where Alerts Run

-**Agents** evaluate alerts against their own collected metrics
-**Parents** (in streaming setups) can evaluate alerts on metrics received from child nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Parents also include Agents set up to collect metrics on behalf of a virtual node, so not just streaming setups.


-**Agents** evaluate alerts against their own collected metrics
-**Parents** (in streaming setups) can evaluate alerts on metrics received from child nodes
-**Netdata Cloud** receives alert events from Agents/Parents and presents them in a unified view, but does**not** re-evaluate the rules itself
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd say alert events are received from Agents (without calling out Parents), but then we probably should mention that if multiple Agents represent a given node, their reported statuses are consolidated for that node.

**Location:** Netdata Cloud backend + `/var/lib/netdata/config/` on agents

When you create or edit an alert using the **Alerts Configuration Manager** in Netdata Cloud:
- The alert definition is **stored in Netdata Cloud** as the source of truth
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is not correct. I commented on this elsewhere.

When you create or edit an alert using the **Alerts Configuration Manager** in Netdata Cloud:
- The alert definition is **stored in Netdata Cloud** as the source of truth
- Cloud **pushes the definition** to connected agents at runtime via the ACLK
- Agents **persist it to disk** at `/var/lib/netdata/config/*.dyncfg` for reliability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We should either leave out this detail, or explicitly tell people that manually editing the files there is completely unsupported and the outcome undefined.


**Key characteristics:**

- **Zero-touch rollout** Create an alert once in Cloud, it applies to all nodes in that space instantly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't know what this means.

| Alert Type | Best Used When | Common Use Cases |
|------------|----------------|------------------|
| **Cloud-defined alerts** | • You want centralized management across many nodes<br/>• You need instant rollout of new alerts or threshold changes<br/>• You prefer a UI workflow over editing config files<br/>• You want to leverage Cloud's deduplication and notification routing | Standard monitoring across your fleet |
| **File-based alerts** | • You need version control for alert definitions (Git, etc.)<br/>• You want full control over local configuration<br/>• You need custom syntax not yet supported in the Cloud UI<br/>• You prefer infrastructure-as-code workflows | Node-specific or advanced configurations |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we should mention the configuration management systems aspect here, too (e.g. Ansible, Helm, etc.). Not just version control.

2. **Custom alerts** are loaded next from `/etc/netdata/health.d/`
- If a custom alert has the **same name** as a stock alert, the custom version **overrides** it
3. **Cloud-defined alerts** are loaded at runtime and **coexist** with file-based alerts
- Cloud alerts use unique identifiers and typically don't conflict with file-based alert names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think "Cloud-defined alerts" is misleading. The only Cloud thing about it is that you use the Cloud UI to push the alert definition to one or more Agents. Also, people need provide the name of an alert themselves. They are not magically unique. Alert definitions with the same name as a stock or custom (file-based) alert on the same Agent will override that definition.

@ralphm
Copy link
Member

Another observation: I see that the "main pages" for the chapters have the same name as the directory the subpages are in. Wouldn't it be better to move those into the directory asindex.md?

@Ancairon
Copy link
Member

@ralphm this is on purpose I think, as that's how you make the a folder be clickable and have something like an overview on Learn

kanelatechnical reacted with thumbs up emoji

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@cubic-dev-aicubic-dev-ai[bot]cubic-dev-ai[bot] left review comments

@ralphmralphmralphm requested changes

@AncaironAncaironAwaiting requested review from AncaironAncairon is a code owner

@ktsaouktsaouAwaiting requested review from ktsaou

@shyamvalsanshyamvalsanAwaiting requested review from shyamvalsan

@sashwathnsashwathnAwaiting requested review from sashwathn

Requested changes must be addressed to merge this pull request.

Assignees

No one assigned

Labels

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@kanelatechnical@ilyam8@ralphm@Ancairon

[8]ページ先頭

©2009-2025 Movatter.jp