Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: measure pubsub latencies and expose metrics#13126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
dannykopping merged 17 commits intocoder:mainfromdannykopping:dk/pubsub-latency
May 10, 2024

Conversation

dannykopping
Copy link
Contributor

Pubsub is a crucial aspect of Coder's design. We use PostgresLISTEN/NOTIFY for this purpose, but the problem is that this subsystem of Postgres is not very observable. Atmost, one can observehow full its buffers are.

This PR adds 3 metrics:

  • coder_pubsub_send_latency_seconds
  • coder_pubsub_receive_latency_seconds
  • coder_pubsub_latency_measure_errs_total

These metrics are tracked on each coderd replica, and each replica will use its own notification channel.

Ideally we want to isolate Postgres' contribution to the latency as much as possible, but since there's a network in between this won't be very accurate; however these metrics will still be Good Enough™️ to alert on - to give operators an indication why coderd appears to be slowing down.

Operators can use these metrics in conjunction with the rate of change incoder_pubsub_{publishes,subscribes}_total to infer if the queue has become overly large and/or the receivers too bogged down to alleviate pressure on the pubsub queue. If Postgres is underresourced on CPU this might also be a contributing factor.

TODO: add this as a healthcheck in a subsequent PR.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@dannykoppingdannykopping changed the titlefeat(database): measure pubsub latency and expose as metricsfeat: measure pubsub latency and expose as metricsMay 2, 2024
@dannykoppingdannykopping changed the titlefeat: measure pubsub latency and expose as metricsfeat: measure pubsub latencies and expose metricsMay 2, 2024
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
p.qMu.Unlock()
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

// additional metrics
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I worry about running a measurement with a 10s timeout from within Collect(), which typically gets called every 15s or so.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I settled on that number because a) it's sufficiently high as to indicate a serious problem, b) it won't disrupt the default 1m scrape interval (source), and c) even with a low interval like 15s it still will not interfere with scrapes.

We could lower this a bit though, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess, what's the expectation for how quicklyCollect() returns?

// This method may be called concurrently and must therefore be// implemented in a concurrency safe way. Blocking occurs at the expense// of total performance of rendering all registered metrics. Ideally,// Collector implementations support concurrent readers.

If NOTIFY is hobbled, then we're blockingCollect() for up to 10s, which doesn't seem like we're playing nice.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it's acceptable to haveCollect() block while we measure the latency, provided we don't overrun a reasonable scrape interval. In most architectures it's rare to have multiple scrapers scraping the same targets, soCollect()should only ever be called once at a time.

Alternatively we could spawn a background process which measures latency periodically andCollect() simply reads its state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The only hesitation I have with background measurement is that scrapes will be showing stale data, which may give operators the wrong impression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This architecture should be fine with multiple collections, as the unique ping will just get ignored by other calls to Collect().

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK, let's give it a try:a7c042f

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I decided to double back to synchronous-only collection:33a2a1d
The complexity background collection introduced wasn't worth the benefit; I've satisfied my curiosity now 😅

)

cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) {
res <- time.Since(start).Seconds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

having each measurement be an identical message (and ignoring it) could lead to previously timed-out measurements getting processed here. That could lead to very strange results such as a ping arriving beforestart is set, which will result in therecv latency being two millennia 🙀

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good call! OK, I'll refactor to generate a unique message and only mark the latency roundtrip done when I observe that.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let me know what you think of65f57b1 please

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

49d2002 supercedes the above

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
)

cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) {
res <- time.Since(start).Seconds()
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

49d2002 supercedes the above


// Force fast publisher to start after the slow one to avoid flakiness.
<-hold
time.Sleep(time.Millisecond * 50)
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

YOLO-ed this a bit because Go's scheduler gives a goroutine at most 20ms on CPU before it is pre-empted; I added a bit of extra time to be sure. I ran this test with-count=100 and didn't observe any flakes, but I'm open to better approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is probably OK for now, although the scheduler latency on Windows can apparently vary.

The approach I like to use for controlling time is to pass in achan time.Time. Any time something tries to determine the current time then comes from reading that channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.

I think you should simplify andfake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@dannykoppingdannykopping marked this pull request as ready for reviewMay 3, 2024 11:22

// Force fast publisher to start after the slow one to avoid flakiness.
<-hold
time.Sleep(time.Millisecond * 50)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is probably OK for now, although the scheduler latency on Windows can apparently vary.

The approach I like to use for controlling time is to pass in achan time.Time. Any time something tries to determine the current time then comes from reading that channel.

@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
p.qMu.Unlock()
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

// additional metrics
ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.


// Force fast publisher to start after the slow one to avoid flakiness.
<-hold
time.Sleep(time.Millisecond * 50)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.

I think you should simplify andfake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Refactor async measurement for immediate exit upon signalSigned-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Copy link
Member

@johnstcnjohnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We should also updatedocs/admin/prometheus.md with the added metrics. LGTM once that's done.

…orth the complexitySigned-off-by: Danny Kopping <danny@coder.com>
@dannykopping
Copy link
ContributorAuthor

We should also updatedocs/admin/prometheus.md with the added metrics. LGTM once that's done.

Added#13223 to track this; seems like the metrics aren't detected automatically.

johnstcn reacted with thumbs up emoji

Copy link
Contributor

@spikecurtisspikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Minor suggestion inline, but I don't need to review again

@@ -528,6 +566,20 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
p.qMu.Unlock()
metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

// additional metrics
ctx, cancel := context.WithTimeout(context.Background(), LatencyMeasureInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Small quibble with the name: "interval" sounds to me like a periodic thing, so I would suggestLatencyMeasureTimeout

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good call
Thanks for the patience on this PR!

Signed-off-by: Danny Kopping <danny@coder.com>
@dannykoppingdannykoppingenabled auto-merge (squash)May 10, 2024 12:20
@dannykoppingdannykopping merged commit4671ebb intocoder:mainMay 10, 2024
@github-actionsgithub-actionsbot locked and limited conversation to collaboratorsMay 10, 2024
@dannykoppingdannykopping deleted the dk/pubsub-latency branchMay 10, 2024 12:41
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.
Reviewers

@johnstcnjohnstcnjohnstcn left review comments

@spikecurtisspikecurtisspikecurtis approved these changes

@mafredrimafredriAwaiting requested review from mafredri

Assignees

@dannykoppingdannykopping

Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@dannykopping@johnstcn@spikecurtis

[8]ページ先頭

©2009-2025 Movatter.jp