- Notifications
You must be signed in to change notification settings - Fork928
feat: measure pubsub latencies and expose metrics#13126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Uh oh!
There was an error while loading.Please reload this page.
coderd/database/pubsub/pubsub.go Outdated
@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) { | |||
p.qMu.Unlock() | |||
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs)) | |||
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events)) | |||
// additional metrics | |||
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I worry about running a measurement with a 10s timeout from within Collect(), which typically gets called every 15s or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I settled on that number because a) it's sufficiently high as to indicate a serious problem, b) it won't disrupt the default 1m scrape interval (source), and c) even with a low interval like 15s it still will not interfere with scrapes.
We could lower this a bit though, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I guess, what's the expectation for how quicklyCollect()
returns?
// This method may be called concurrently and must therefore be// implemented in a concurrency safe way. Blocking occurs at the expense// of total performance of rendering all registered metrics. Ideally,// Collector implementations support concurrent readers.
If NOTIFY is hobbled, then we're blockingCollect()
for up to 10s, which doesn't seem like we're playing nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think it's acceptable to haveCollect()
block while we measure the latency, provided we don't overrun a reasonable scrape interval. In most architectures it's rare to have multiple scrapers scraping the same targets, soCollect()
should only ever be called once at a time.
Alternatively we could spawn a background process which measures latency periodically andCollect()
simply reads its state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The only hesitation I have with background measurement is that scrapes will be showing stale data, which may give operators the wrong impression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This architecture should be fine with multiple collections, as the unique ping will just get ignored by other calls to Collect().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OK, let's give it a try:a7c042f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I decided to double back to synchronous-only collection:33a2a1d
The complexity background collection introduced wasn't worth the benefit; I've satisfied my curiosity now 😅
coderd/database/pubsub/latency.go Outdated
) | ||
cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) { | ||
res <- time.Since(start).Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
having each measurement be an identical message (and ignoring it) could lead to previously timed-out measurements getting processed here. That could lead to very strange results such as a ping arriving beforestart
is set, which will result in therecv
latency being two millennia 🙀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Good call! OK, I'll refactor to generate a unique message and only mark the latency roundtrip done when I observe that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Let me know what you think of65f57b1 please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
49d2002 supercedes the above
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Signed-off-by: Danny Kopping <danny@coder.com>
coderd/database/pubsub/latency.go Outdated
) | ||
cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) { | ||
res <- time.Since(start).Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
49d2002 supercedes the above
// Force fast publisher to start after the slow one to avoid flakiness. | ||
<-hold | ||
time.Sleep(time.Millisecond * 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
YOLO-ed this a bit because Go's scheduler gives a goroutine at most 20ms on CPU before it is pre-empted; I added a bit of extra time to be sure. I ran this test with-count=100
and didn't observe any flakes, but I'm open to better approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is probably OK for now, although the scheduler latency on Windows can apparently vary.
The approach I like to use for controlling time is to pass in achan time.Time
. Any time something tries to determine the current time then comes from reading that channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.
I think you should simplify andfake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Uh oh!
There was an error while loading.Please reload this page.
// Force fast publisher to start after the slow one to avoid flakiness. | ||
<-hold | ||
time.Sleep(time.Millisecond * 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is probably OK for now, although the scheduler latency on Windows can apparently vary.
The approach I like to use for controlling time is to pass in achan time.Time
. Any time something tries to determine the current time then comes from reading that channel.
coderd/database/pubsub/pubsub.go Outdated
@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) { | |||
p.qMu.Unlock() | |||
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs)) | |||
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events)) | |||
// additional metrics | |||
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
// Force fast publisher to start after the slow one to avoid flakiness. | ||
<-hold | ||
time.Sleep(time.Millisecond * 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.
I think you should simplify andfake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Uh oh!
There was an error while loading.Please reload this page.
Refactor async measurement for immediate exit upon signalSigned-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We should also updatedocs/admin/prometheus.md
with the added metrics. LGTM once that's done.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
…orth the complexitySigned-off-by: Danny Kopping <danny@coder.com>
Added#13223 to track this; seems like the metrics aren't detected automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Minor suggestion inline, but I don't need to review again
coderd/database/pubsub/pubsub.go Outdated
@@ -528,6 +566,20 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) { | |||
p.qMu.Unlock() | |||
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs)) | |||
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events)) | |||
// additional metrics | |||
ctx, cancel := context.WithTimeout(context.Background(), LatencyMeasureInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Small quibble with the name: "interval" sounds to me like a periodic thing, so I would suggestLatencyMeasureTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Good call
Thanks for the patience on this PR!
Signed-off-by: Danny Kopping <danny@coder.com>
Pubsub is a crucial aspect of Coder's design. We use Postgres
LISTEN
/NOTIFY
for this purpose, but the problem is that this subsystem of Postgres is not very observable. Atmost, one can observehow full its buffers are.This PR adds 3 metrics:
coder_pubsub_send_latency_seconds
coder_pubsub_receive_latency_seconds
coder_pubsub_latency_measure_errs_total
These metrics are tracked on each coderd replica, and each replica will use its own notification channel.
Ideally we want to isolate Postgres' contribution to the latency as much as possible, but since there's a network in between this won't be very accurate; however these metrics will still be Good Enough™️ to alert on - to give operators an indication why coderd appears to be slowing down.
Operators can use these metrics in conjunction with the rate of change in
coder_pubsub_{publishes,subscribes}_total
to infer if the queue has become overly large and/or the receivers too bogged down to alleviate pressure on the pubsub queue. If Postgres is underresourced on CPU this might also be a contributing factor.TODO: add this as a healthcheck in a subsequent PR.