- Notifications
You must be signed in to change notification settings - Fork929
feat: implement observability of notifications subsystem#13799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Changes from1 commit
7f60c0f
d62d704
96dac65
130de49
e868752
cee93cb
387b557
114797d
5ff29c0
88451a1
09f7305
91e2a23
9f1d6b3
15c4537
53ecad4
bc2a4cb
716e591
d408ed2
2b9eec3
4211c84
24417c5
72bb1be
bfca2c1
6602682
00633a1
f454184
84d07d4
File filter
Filter by extension
Conversations
Uh oh!
There was an error while loading.Please reload this page.
Jump to
Uh oh!
There was an error while loading.Please reload this page.
Diff view
Diff view
Signed-off-by: Danny Kopping <danny@coder.com>
- Loading branch information
Uh oh!
There was an error while loading.Please reload this page.
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -28,21 +28,26 @@ const ( | ||
func NewMetrics(reg prometheus.Registerer) *Metrics { | ||
return &Metrics{ | ||
DispatchedCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "dispatched_count", Namespace: ns, Subsystem: subsystem, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. accumulating counts should end in the units andtotal as in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Wow, 4 months out of Grafana Labs and I forget all my Prometheus manners. | ||
Help: "The count of notifications successfully dispatched.", | ||
}, []string{LabelMethod, LabelTemplateID}), | ||
TempFailureCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "temporary_failures_count", Namespace: ns, Subsystem: subsystem, | ||
Help: "The count of notifications which failed but have retry attempts remaining.", | ||
}, []string{LabelMethod, LabelTemplateID}), | ||
PermFailureCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "permanent_failures_count", Namespace: ns, Subsystem: subsystem, | ||
Help: "The count of notifications which failed and have exceeded their retry attempts.", | ||
}, []string{LabelMethod, LabelTemplateID}), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. DispatchCount, TempFailureCount, and PermFailureCount might make sense as a single metric with different "disposition" (or "result") labels. The help text refers to them as "notifications", but I think that's misleading. What we are counting is "notification attempts" or "dispatch attempts", which might individually end up in success or some kind of failure. In particular, in the case of failure and retry, the number of delivery attempts can exceed the number of notifications, so it's important to make the distinction. Prometheus guidelines say that something should be a single metric with labels (vs multiple metrics) when the sum or average of the label is sensible and coherent. Filtering by method and summing across disposition would be useful, for example, to track the number of web requests we are making via Webhook delivery. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Super idea! Simplifies things a lot. | ||
RetryCount: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "retry_count", Namespace: ns, Subsystem: subsystem, | ||
Help: "The count of notification dispatch retry attempts.", | ||
}, []string{LabelMethod, LabelTemplateID}), | ||
// Aggregating on LabelTemplateID as well would cause a cardinality explosion. | ||
QueuedSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{ | ||
Name: "queued_seconds", Namespace: ns, Subsystem: subsystem, | ||
Buckets: []float64{0.1, 1, 5, 15, 30, 60, 120, 300, 600, 3600, 86400}, | ||
Help: "The time elapsed between a notification being enqueued in the store and retrieved for processing " + | ||
"(measures the latency of the notifications system). This should generally be within CODER_NOTIFICATIONS_FETCH_INTERVAL " + | ||
@@ -51,13 +56,15 @@ func NewMetrics(reg prometheus.Registerer) *Metrics { | ||
}, []string{LabelMethod}), | ||
// Aggregating on LabelTemplateID as well would cause a cardinality explosion. | ||
DispatcherSendSeconds: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{ | ||
dannykopping marked this conversation as resolved. Show resolvedHide resolvedUh oh!There was an error while loading.Please reload this page. | ||
Name: "dispatcher_send_seconds", Namespace: ns, Subsystem: subsystem, | ||
Buckets: []float64{0.001, 0.05, 0.1, 0.5, 1, 2, 5, 10, 15, 30, 60, 120}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I don't see much value in distinguishing between 1ms, 5ms, and 10ms; for most systems the latency will be higher than that (I'm assuming). The high end is what we're more concerned about here since that's indicative of a problem. In other words: this metric is more for diagnosing issues than measuring performance of the endpoint/server. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Dispatching, say, a webhook, within the same cluster could be a few milliseconds latency. Same geographic region could be 10-20ms. Almost nothing is going to be 1ms unless it's extremely simple and extremely close, but in 50ms you could be dispatching across an entire continent. With buckets like this you could have something that normally returns in 15ms take 3x its normal time and be close to falling over and this metric would not clock it. That seems much more important than 30s, 60s, 120s. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I hear you and I can see where you're coming from. Let's wait for users to ask for this. I think there's sufficient coverage for now, and the additional datapoints are likely not going to be very useful for the majority of operators IMHO. | ||
Help: "The time taken to dispatch notifications.", | ||
}, []string{LabelMethod}), | ||
// Currently no requirement to discriminate between success and failure updates which are pending. | ||
PendingUpdates: promauto.With(reg).NewGauge(prometheus.GaugeOpts{ | ||
Name: "pending_updates", Namespace: ns, Subsystem: subsystem, | ||
Help: "The number of updates waiting to be flushed to the store.", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Sounds too generic to me. "The number of delivery attempt results waiting to be flushed to the store" ? | ||
}), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. A really useful Gauge is the number of "in progress" delivery attempts. You increment it when you kick off the delivery goroutine and decrement it when you send to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Good idea! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Also a count of success/failure updates flushed to the database. | ||
} | ||