Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: implement observability of notifications subsystem#13799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
dannykopping merged 27 commits intomainfromdk/system-notifications-o11y
Jul 11, 2024
Merged
Changes from1 commit
Commits
Show all changes
27 commits
Select commitHold shift + click to select a range
7f60c0f
Implement observability of notification subsystem
dannykoppingJul 3, 2024
d62d704
make lint
dannykoppingJul 5, 2024
96dac65
make gen
dannykoppingJul 5, 2024
130de49
make fmt
dannykoppingJul 8, 2024
e868752
Small fixes
dannykoppingJul 8, 2024
cee93cb
Review comments
dannykoppingJul 8, 2024
387b557
Apply suggestions from code review
dannykoppingJul 8, 2024
114797d
Correcting query
dannykoppingJul 8, 2024
5ff29c0
Merge branch 'main' of github.com:/coder/coder into dk/system-notific…
dannykoppingJul 9, 2024
88451a1
Only return UUID from EnqueueNotificationMessage
dannykoppingJul 9, 2024
09f7305
Review feedback
dannykoppingJul 9, 2024
91e2a23
Minor fixups
dannykoppingJul 9, 2024
9f1d6b3
Revert hack, no output param needed
dannykoppingJul 10, 2024
15c4537
Small touch-ups
dannykoppingJul 10, 2024
53ecad4
Merge branch 'main' of https://github.com/coder/coder into dk/system-…
dannykoppingJul 10, 2024
bc2a4cb
Merge branch 'main' of https://github.com/coder/coder into dk/system-…
dannykoppingJul 10, 2024
716e591
Harden tests, fail early
dannykoppingJul 10, 2024
d408ed2
make fmt
dannykoppingJul 10, 2024
2b9eec3
Restoring deleted line
dannykoppingJul 10, 2024
4211c84
Comments
dannykoppingJul 10, 2024
24417c5
Lock before modification
dannykoppingJul 10, 2024
72bb1be
Remove TestNotifierPaused's unnecessarily fast fetch interval
dannykoppingJul 10, 2024
bfca2c1
Merge branch 'main' of https://github.com/coder/coder into dk/system-…
dannykoppingJul 10, 2024
6602682
Rename migration after numbering conflict
dannykoppingJul 10, 2024
00633a1
Small fixes
dannykoppingJul 11, 2024
f454184
Merge branch 'main' of https://github.com/coder/coder into dk/system-…
dannykoppingJul 11, 2024
84d07d4
Logging improvement
dannykoppingJul 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
PrevPrevious commit
NextNext commit
make fmt
Signed-off-by: Danny Kopping <danny@coder.com>
  • Loading branch information
@dannykopping
dannykopping committedJul 8, 2024
commit130de4952f6488e699b1f696e2f86a7f0260c6dd
21 changes: 14 additions & 7 deletionscoderd/notifications/metrics.go
View file
Open in desktop
Original file line numberDiff line numberDiff line change
Expand Up@@ -28,21 +28,26 @@ const (

func NewMetrics(reg prometheus.Registerer) *Metrics {
return &Metrics{
DispatchedCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{Name: "dispatched_count", Namespace: ns, Subsystem: subsystem,
DispatchedCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Name: "dispatched_count", Namespace: ns, Subsystem: subsystem,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

accumulating counts should end in the units andtotal as incoderd_notifications_successful_deliveries_total, where "deliveries" is the unit (could use "dispatch_attempts", or something similar).

https://prometheus.io/docs/practices/naming/

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Wow, 4 months out of Grafana Labs and I forget all my Prometheus manners.
Thanks!

Help: "The count of notifications successfully dispatched.",
}, []string{LabelMethod, LabelTemplateID}),
TempFailureCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{Name: "temporary_failures_count", Namespace: ns, Subsystem: subsystem,
TempFailureCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Name: "temporary_failures_count", Namespace: ns, Subsystem: subsystem,
Help: "The count of notifications which failed but have retry attempts remaining.",
}, []string{LabelMethod, LabelTemplateID}),
PermFailureCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{Name: "permanent_failures_count", Namespace: ns, Subsystem: subsystem,
PermFailureCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Name: "permanent_failures_count", Namespace: ns, Subsystem: subsystem,
Help: "The count of notifications which failed and have exceeded their retry attempts.",
}, []string{LabelMethod, LabelTemplateID}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

DispatchCount, TempFailureCount, and PermFailureCount might make sense as a single metric with different "disposition" (or "result") labels. The help text refers to them as "notifications", but I think that's misleading. What we are counting is "notification attempts" or "dispatch attempts", which might individually end up in success or some kind of failure. In particular, in the case of failure and retry, the number of delivery attempts can exceed the number of notifications, so it's important to make the distinction.

Prometheus guidelines say that something should be a single metric with labels (vs multiple metrics) when the sum or average of the label is sensible and coherent. Filtering by method and summing across disposition would be useful, for example, to track the number of web requests we are making via Webhook delivery.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Super idea! Simplifies things a lot.

RetryCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{Name: "retry_count", Namespace: ns, Subsystem: subsystem,
RetryCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Name: "retry_count", Namespace: ns, Subsystem: subsystem,
Help: "The count of notification dispatch retry attempts.",
}, []string{LabelMethod, LabelTemplateID}),

// Aggregating on LabelTemplateID as well would cause a cardinality explosion.
QueuedSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{Name: "queued_seconds", Namespace: ns, Subsystem: subsystem,
QueuedSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Name: "queued_seconds", Namespace: ns, Subsystem: subsystem,
Buckets: []float64{0.1, 1, 5, 15, 30, 60, 120, 300, 600, 3600, 86400},
Help: "The time elapsed between a notification being enqueued in the store and retrieved for processing " +
"(measures the latency of the notifications system). This should generally be within CODER_NOTIFICATIONS_FETCH_INTERVAL " +
Expand All@@ -51,13 +56,15 @@ func NewMetrics(reg prometheus.Registerer) *Metrics {
}, []string{LabelMethod}),

// Aggregating on LabelTemplateID as well would cause a cardinality explosion.
DispatcherSendSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{Name: "dispatcher_send_seconds", Namespace: ns, Subsystem: subsystem,
DispatcherSendSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Name: "dispatcher_send_seconds", Namespace: ns, Subsystem: subsystem,
Buckets: []float64{0.001, 0.05, 0.1, 0.5, 1, 2, 5, 10, 15, 30, 60, 120},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

should be0.001, 0.005, 0.01, 0.05 on the bottom end, right?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't see much value in distinguishing between 1ms, 5ms, and 10ms; for most systems the latency will be higher than that (I'm assuming). The high end is what we're more concerned about here since that's indicative of a problem. In other words: this metric is more for diagnosing issues than measuring performance of the endpoint/server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Dispatching, say, a webhook, within the same cluster could be a few milliseconds latency. Same geographic region could be 10-20ms.

Almost nothing is going to be 1ms unless it's extremely simple and extremely close, but in 50ms you could be dispatching across an entire continent. With buckets like this you could have something that normally returns in 15ms take 3x its normal time and be close to falling over and this metric would not clock it. That seems much more important than 30s, 60s, 120s.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I hear you and I can see where you're coming from.

Let's wait for users to ask for this. I think there's sufficient coverage for now, and the additional datapoints are likely not going to be very useful for the majority of operators IMHO.

Help: "The time taken to dispatch notifications.",
}, []string{LabelMethod}),

// Currently no requirement to discriminate between success and failure updates which are pending.
PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{Name: "pending_updates", Namespace: ns, Subsystem: subsystem,
PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{
Name: "pending_updates", Namespace: ns, Subsystem: subsystem,
Help: "The number of updates waiting to be flushed to the store.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sounds too generic to me. "The number of delivery attempt results waiting to be flushed to the store" ?

dannykopping reacted with thumbs up emoji
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A really useful Gauge is the number of "in progress" delivery attempts. You increment it when you kick off the delivery goroutine and decrement it when you send tosuccess orfailure channels. Can help spot hung handlers/leaked goroutines

dannykopping reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good idea!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also a count of success/failure updates flushed to the database.

dannykopping reacted with thumbs up emoji
}
Expand Down

[8]ページ先頭

©2009-2025 Movatter.jp