NotificationsYou must be signed in to change notification settings
Fork928
Star10.1k

feat: measure pubsub latencies and expose metrics#13126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

dannykopping merged 17 commits intocoder:mainfromdannykopping:dk/pubsub-latency

May 10, 2024

Merged

feat: measure pubsub latencies and expose metrics#13126

dannykopping merged 17 commits intocoder:mainfromdannykopping:dk/pubsub-latency

May 10, 2024

Conversation

Copy link

Contributor

dannykopping commentedMay 2, 2024

Pubsub is a crucial aspect of Coder's design. We use PostgresLISTEN/NOTIFY for this purpose, but the problem is that this subsystem of Postgres is not very observable. Atmost, one can observehow full its buffers are.

This PR adds 3 metrics:

coder_pubsub_send_latency_seconds
coder_pubsub_receive_latency_seconds
coder_pubsub_latency_measure_errs_total

These metrics are tracked on each coderd replica, and each replica will use its own notification channel.

Ideally we want to isolate Postgres' contribution to the latency as much as possible, but since there's a network in between this won't be very accurate; however these metrics will still be Good Enough™️ to alert on - to give operators an indication why coderd appears to be slowing down.

Operators can use these metrics in conjunction with the rate of change incoder_pubsub_{publishes,subscribes}_total to infer if the queue has become overly large and/or the receivers too bogged down to alleviate pressure on the pubsub queue. If Postgres is underresourced on CPU this might also be a contributing factor.

TODO: add this as a healthcheck in a subsequent PR.

dannykopping added3 commits

May 2, 2024 09:10

Measuring pubsub latency and exposing as Prometheus metric

c249b70

Signed-off-by: Danny Kopping <danny@coder.com>

Track errors

f845bda

Signed-off-by: Danny Kopping <danny@coder.com>

Explicitly checking latencies are >0

4f62b40

Signed-off-by: Danny Kopping <danny@coder.com>

github-actionsbot assigneddannykopping

May 2, 2024

dannykopping changed the title~~feat(database): measure pubsub latency and expose as metrics~~feat: measure pubsub latency and expose as metrics

May 2, 2024

dannykopping changed the title~~feat: measure pubsub latency and expose as metrics~~feat: measure pubsub latencies and expose metrics

May 2, 2024

dannykopping added2 commits

May 2, 2024 10:13

Enhancements

3d9e3dd

Signed-off-by: Danny Kopping <danny@coder.com>

Fix TestPGPubsub_Metrics test

34083d0

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping mentioned this pull request

May 2, 2024

Add integration for Postgres in observability chart#12995

Closed

4 tasks

spikecurtis reviewed

May 2, 2024

View reviewed changes

coderd/database/pubsub/latency.go OutdatedShow resolvedHide resolved

spikecurtis reviewed

May 2, 2024

View reviewed changes

coderd/database/pubsub/pubsub.go Outdated

		@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
		p.qMu.Unlock()
		metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
		metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

		// additional metrics
		ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)

Copy link

Contributor

spikecurtisMay 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I worry about running a measurement with a 10s timeout from within Collect(), which typically gets called every 15s or so.

Copy link

ContributorAuthor

dannykoppingMay 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I settled on that number because a) it's sufficiently high as to indicate a serious problem, b) it won't disrupt the default 1m scrape interval (source), and c) even with a low interval like 15s it still will not interfere with scrapes.

We could lower this a bit though, what do you think?

Copy link

Contributor

spikecurtisMay 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I guess, what's the expectation for how quicklyCollect() returns?

// This method may be called concurrently and must therefore be// implemented in a concurrency safe way. Blocking occurs at the expense// of total performance of rendering all registered metrics. Ideally,// Collector implementations support concurrent readers.

If NOTIFY is hobbled, then we're blockingCollect() for up to 10s, which doesn't seem like we're playing nice.

Copy link

ContributorAuthor

dannykoppingMay 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it's acceptable to haveCollect() block while we measure the latency, provided we don't overrun a reasonable scrape interval. In most architectures it's rare to have multiple scrapers scraping the same targets, soCollect()should only ever be called once at a time.

Alternatively we could spawn a background process which measures latency periodically andCollect() simply reads its state.

Copy link

Member

johnstcnMay 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.

Copy link

ContributorAuthor

dannykoppingMay 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The only hesitation I have with background measurement is that scrapes will be showing stale data, which may give operators the wrong impression.

Copy link

Contributor

spikecurtisMay 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This architecture should be fine with multiple collections, as the unique ping will just get ignored by other calls to Collect().

Copy link

ContributorAuthor

dannykoppingMay 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK, let's give it a try:a7c042f

Copy link

ContributorAuthor

dannykoppingMay 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I decided to double back to synchronous-only collection:33a2a1d
The complexity background collection introduced wasn't worth the benefit; I've satisfied my curiosity now 😅

coderd/database/pubsub/latency.go Outdated

		)

		cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) {
		res <- time.Since(start).Seconds()

Copy link

Contributor

spikecurtisMay 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

having each measurement be an identical message (and ignoring it) could lead to previously timed-out measurements getting processed here. That could lead to very strange results such as a ping arriving beforestart is set, which will result in therecv latency being two millennia 🙀

Copy link

ContributorAuthor

dannykoppingMay 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good call! OK, I'll refactor to generate a unique message and only mark the latency roundtrip done when I observe that.

Copy link

ContributorAuthor

dannykoppingMay 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let me know what you think of65f57b1 please

Copy link

ContributorAuthor

dannykoppingMay 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

49d2002 supercedes the above

dannykopping added2 commits

May 2, 2024 11:47

Refactor to avoid global state

28a96de

Signed-off-by: Danny Kopping <danny@coder.com>

Protect against NOTIFY races a slow receiver

65f57b1

Signed-off-by: Danny Kopping <danny@coder.com>

spikecurtis reviewed

May 2, 2024

View reviewed changes

coderd/database/pubsub/latency.go OutdatedShow resolvedHide resolved

Reliably cause notify races by delaying publishes

49d2002

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping commented

May 3, 2024

View reviewed changes

coderd/database/pubsub/latency.go Outdated

		)

		cancel, err := p.Subscribe(latencyChannelName(), func(ctx context.Context, _ []byte) {
		res <- time.Since(start).Seconds()

Copy link

ContributorAuthor

dannykoppingMay 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

49d2002 supercedes the above

coderd/database/pubsub/pubsub_linux_test.go Outdated


		// Force fast publisher to start after the slow one to avoid flakiness.
		<-hold
		time.Sleep(time.Millisecond * 50)

Copy link

ContributorAuthor

dannykoppingMay 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

YOLO-ed this a bit because Go's scheduler gives a goroutine at most 20ms on CPU before it is pre-empted; I added a bit of extra time to be sure. I ran this test with-count=100 and didn't observe any flakes, but I'm open to better approaches.

Copy link

Member

johnstcnMay 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is probably OK for now, although the scheduler latency on Windows can apparently vary.

The approach I like to use for controlling time is to pass in achan time.Time. Any time something tries to determine the current time then comes from reading that channel.

Copy link

Contributor

spikecurtisMay 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.

I think you should simplify andfake the PubSub, not wrap it with a delay. Then you only need to have a single measurer, and you send it a bogus ping prior to sending the real one it published. Added bonus is that is will execute in microseconds not a second.

Copy link

ContributorAuthor

dannykoppingMay 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

633365d

dannykopping marked this pull request as ready for review

May 3, 2024 11:22

dannykopping requested review frommafredri andjohnstcn

May 3, 2024 14:35

johnstcn reviewed

May 3, 2024

View reviewed changes

coderd/database/pubsub/latency.go OutdatedShow resolvedHide resolved

coderd/database/pubsub/pubsub_linux_test.go Outdated


		// Force fast publisher to start after the slow one to avoid flakiness.
		<-hold
		time.Sleep(time.Millisecond * 50)

Copy link

Member

johnstcnMay 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is probably OK for now, although the scheduler latency on Windows can apparently vary.

The approach I like to use for controlling time is to pass in achan time.Time. Any time something tries to determine the current time then comes from reading that channel.

coderd/database/pubsub/pubsub.go Outdated

		@@ -528,6 +555,19 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
		p.qMu.Unlock()
		metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
		metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

		// additional metrics
		ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)

Copy link

Member

johnstcnMay 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I like the idea of separating measurement in the background, but in practice you need to be careful you don't introduce goleak test flakes.

spikecurtis reviewed

May 6, 2024

View reviewed changes

coderd/database/pubsub/latency.goShow resolvedHide resolved

coderd/database/pubsub/pubsub.go OutdatedShow resolvedHide resolved

coderd/database/pubsub/pubsub_linux_test.go Outdated


		// Force fast publisher to start after the slow one to avoid flakiness.
		<-hold
		time.Sleep(time.Millisecond * 50)

Copy link

Contributor

spikecurtisMay 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This feels too brittle to me. Even if you ran it 100 times locally, we've observed really big variances in scheduling on CI.

dannykopping added4 commits

May 6, 2024 07:54

Merge branch 'main' ofhttps://github.com/coder/coderinto dk/pubsub-…

68412c8

…latency

Measure latency in the background

a7c042f

Signed-off-by: Danny Kopping <danny@coder.com>

Fake pubsub to simulate race

633365d

Signed-off-by: Danny Kopping <danny@coder.com>

make fmt

722a233

Signed-off-by: Danny Kopping <danny@coder.com>

spikecurtis reviewed

May 6, 2024

View reviewed changes

coderd/database/pubsub/pubsub.go OutdatedShow resolvedHide resolved

Stop async measurements on pubsub close

ff73789

Refactor async measurement for immediate exit upon signalSigned-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed thedk/pubsub-latency branch frombf89a2d toff73789Compare

May 7, 2024 10:29

dannykopping requested review fromspikecurtis andjohnstcn

May 8, 2024 08:48

Cancel goroutine immediately on stop

7d055d2

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed thedk/pubsub-latency branch from9f60b66 to7d055d2Compare

May 8, 2024 08:59

johnstcn reviewed

May 9, 2024

View reviewed changes

Copy link

Member

johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We should also updatedocs/admin/prometheus.md with the added metrics. LGTM once that's done.

testutil/prometheus.goShow resolvedHide resolved

coderd/database/pubsub/latency.go OutdatedShow resolvedHide resolved

Merge branch 'main' ofhttps://github.com/coder/coderinto dk/pubsub-…

361538c

…latency

dannykopping force-pushed thedk/pubsub-latency branch from869b13d to361538cCompare

May 10, 2024 10:22

Revert to only synchronous collection; background collection is not w…

33a2a1d

…orth the complexitySigned-off-by: Danny Kopping <danny@coder.com>

Copy link

ContributorAuthor

dannykopping commentedMay 10, 2024

We should also updatedocs/admin/prometheus.md with the added metrics. LGTM once that's done.

Added#13223 to track this; seems like the metrics aren't detected automatically.

spikecurtis approved these changes

May 10, 2024

View reviewed changes

Copy link

Contributor

spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Minor suggestion inline, but I don't need to review again

coderd/database/pubsub/pubsub.go Outdated

		@@ -528,6 +566,20 @@ func (p *PGPubsub) Collect(metrics chan<- prometheus.Metric) {
		p.qMu.Unlock()
		metrics <- prometheus.MustNewConstMetric(currentSubscribersDesc, prometheus.GaugeValue, float64(subs))
		metrics <- prometheus.MustNewConstMetric(currentEventsDesc, prometheus.GaugeValue, float64(events))

		// additional metrics
		ctx, cancel := context.WithTimeout(context.Background(), LatencyMeasureInterval)

Copy link

Contributor

spikecurtisMay 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Small quibble with the name: "interval" sounds to me like a periodic thing, so I would suggestLatencyMeasureTimeout

Copy link

ContributorAuthor

dannykoppingMay 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good call
Thanks for the patience on this PR!

Wording fixup

bad502a

Signed-off-by: Danny Kopping <danny@coder.com>

dannykoppingenabled auto-merge (squash)

May 10, 2024 12:20

dannykopping merged commit4671ebb intocoder:main

May 10, 2024

github-actionsbot locked and limited conversation to collaborators

May 10, 2024

dannykopping deleted the dk/pubsub-latency branch

May 10, 2024 12:41

Labels

None yet

3 participants

Movatterモバイル変換

feat: measure pubsub latencies and expose metrics#13126

feat: measure pubsub latencies and expose metrics#13126

Uh oh!

Conversation

dannykopping commentedMay 2, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dannykopping commentedMay 10, 2024

Uh oh!

spikecurtis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!