The monitorStream function can block for a long time when creating and
installing a snapshot of a stream's state. This can lead to increased
tail latency.
This commit extends the RaftNode interface with a new InstallSnapshotAsync
method. This method performs snapshot writing and WAL compaction in a
in a separate goroutine, making the process non-blocking. The existing
InstallSnapshot method is now a synchronous wrapper around the new
asynchronous implementation.

Signed-off-by: Daniele Sciasciadaniele@nats.io

Copy link

ContributorAuthor

sciascid commentedOct 30, 2025•
edited
Loading

This pull request tries to reduce the (tail) latency of publishing to a stream.
Below are results of publishing to a single stream, on a 3 node cluster deployed on a single machine. The table shows throughput in msg/s and various latency measures (min, 50th percentile, 90th percentile and so on) in microseconds. In each experiment the client runs for 60 seconds.

Under low load, single client publishing one message at a time, latency varies between 61 microseconds, up to ~20 milliseconds for unlucky messages. Under high load, single client publishing 500 messages at a time, latency goes from ~1ms all the way to ~40ms.

batch	throughput	min	p50	p90	p99	p99.9	max
1	10131	61	96	109	131	208	20364
500	284923	1003	1456	1969	13688	21280	39995

One cause for the high tail latency is due to periodic snapshotting of the stream. Taking a stream snapshot involves writing a file and sync it to disk, and doing it in a safe way requires more than one sync call (710407c).
During this time, the monitorStream goroutine is blocked and can't process incoming messages. The above patch "moves" the writing and syncing to disk, off the stream's main thread. By doing so tale latency improves significantly:

batch	throughput	min	p50	p90	p99	p99.9	max
1	10238	58	95	109	130	213	7654
500	313891	1073	1484	1990	3434	4653	13250

In addition, under high load, there was ~10% improvement in throughput.

I used the "lats" benchmark client:https://github.com/synadia-labs/lats

MauriceVanVeen reviewed

Oct 30, 2025

View reviewed changes

server/raft.go OutdatedShow resolvedHide resolved

server/jetstream_cluster.go OutdatedShow resolvedHide resolved

server/jetstream_cluster.goShow resolvedHide resolved

sciascid force-pushed theraft-two-step-snapshot branch 7 times, most recently fromf55eda9 toa7e89d1Compare

November 3, 2025 14:50

sciascid marked this pull request as ready for review

November 3, 2025 14:52

sciascid requested a review froma team as acode owner

November 3, 2025 14:52

MauriceVanVeen reviewed

Nov 4, 2025

View reviewed changes

Copy link

Member

MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you add a couple tests, for example toraft_test, to cover the use ofInstallSnapshotAsync as well as whenInstallSnapshot is called when already snapshotting?

server/raft.goShow resolvedHide resolved

server/raft.go

		// InstallSnapshot installs a snapshot for the current applied index. This is a
		// synchronous call that will block until the snapshot is installed, and will
		// release all of the log entries up to the applied index.
		func (n*raft)InstallSnapshot(data []byte)error {

Copy link

Member

MauriceVanVeenNov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Does the performance change due to this now creating both a newch for every call and spinning up a goroutine? Specifically for the paths that don't use async snapshots like meta and consumer.

Copy link

ContributorAuthor

sciascidNov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good question. I did it this way because the goroutine that is spawned acquires a lock on raft only after it wrote the snapshot file. My thinking was that we can we can get some of the benefits also when calling InstallSnapshot. And now that I think about it, it might work very nicely if combined with some of the optimizations in#7355.
I believe there is not too much overhead in spawning a goroutine and making the channel, compared to what the rest of the method is doing (create the snapshot file + a few syncs + compact). I could measure the overhead by changing doSnapshotAsync call with doSnapshot calls, and run the same benchmark and compare it to baseline.
We could also consider to use InstallSnapshotAsync with meta and consumers. I haven't done so because I don't have a way to benchmark those yet.

Copy link

ContributorAuthor

sciascidNov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I tried to evaluate the overhead as explained above. Basically, inmonitorStream I replaceddoSnapshotAsync withdoSnapshot. I got the following results:
overhead.pdf

As expected, overhead is small and sometimes we do get some benefits in terms of latency. Alternatively, we could keep a goroutine ready to go, linked to a IPQueue.
Or if we want to err on the safe side, I could change the patch so that nothing changes in the case of regular InstallSnapshot

Copy link

Member

MauriceVanVeenNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Think it's fine for the scope of this PR. But maybe indeed we'll need to do some more profiling (also including the other perf improvements for 2.14) and see whether we should keep a goroutine ready to go and link to a IPQueue like you mention.

Copy link

ContributorAuthor

sciascid commentedNov 4, 2025

I was a little skeptical of the throughput improvement, and in fact I think that was partially due to the experiment running on a single machine: three servers saturate my cores, and no latency between the servers. So I did some experiments on EC2. Three servers deployed on eu-central-1, each one in different AZs. I repeated the experiments with a batched client and a pipelined client, with and without this optimization. In this setting, there still is a sharp reduction in tail latency, but only a slight increase (if any) in throughput.

batch.pdf
pipeline.pdf

sciascid force-pushed theraft-two-step-snapshot branch 3 times, most recently from2c3a58e to16a5caaCompare

November 5, 2025 15:06

MauriceVanVeen approved these changes

Nov 5, 2025

View reviewed changes

Copy link

Member

MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM

Marked this PR for 2.14, assuming we want to wait and bundle this improvement with the other planned Raft improvements?

server/raft.go

		// InstallSnapshot installs a snapshot for the current applied index. This is a
		// synchronous call that will block until the snapshot is installed, and will
		// release all of the log entries up to the applied index.
		func (n*raft)InstallSnapshot(data []byte)error {

Copy link

Member

MauriceVanVeenNov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

MauriceVanVeen changed the title~~Introduce asynchronous file writing for stream snapshots~~(2.14) Introduce asynchronous file writing for stream snapshots

Nov 5, 2025

sciascid force-pushed theraft-two-step-snapshot branch 2 times, most recently from0562a89 toc6657a4Compare

November 5, 2025 15:21

neilalexander reviewed

Nov 13, 2025

View reviewed changes

server/jetstream_cluster.goShow resolvedHide resolved

server/jetstream_cluster.go

		// This shouldn't happen for streams like it can for pull
		// consumers on idle streams but better to be safe than sorry!
		ne,nb:=n.Size()
		ifcurState==lastState&&ne<compactNumMin&&nb<compactSizeMin {

Copy link

Member

neilalexanderNov 13, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm wondering if this does the right thing, given that there's also thefirstNeedsUpdate andlastNeedsUpdate bools. If either of those are true then theFirst andLast may be untrustworthy for this comparison. My feeling is that they shouldn't be set for a filtered state of_EMPTY_ but possible to double check?

Copy link

ContributorAuthor

sciascidNov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I tried to make sure that the optimization in this patch makes no changes in the logic of when to create a snapshot. And I also verified that under the same workload, this patch will roughly result in the same number of snapshot operations as the original code.
Having said that, I do think that checking the store's state is unnecessary. And I did play with changing these conditions a bit. However, I'd prefer to do this in a separate patch.

Copy link

ContributorAuthor

sciascidNov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just elaborating my previous comment. The primary reason for taking snapshots of the stream is to trim raft's log. The logic that is implemented here involves the size and number of entries in the log, and this curState and lastState comparison. But this does not tell us anything about how much of raft's log we can get rid of. A better strategy would be the following: we compact if we can get rid of least one entire block. And we should not bother compacting partial blocks during normal operation.

server/jetstream_cluster.goShow resolvedHide resolved

server/jetstream_cluster.go OutdatedShow resolvedHide resolved

sciascid force-pushed theraft-two-step-snapshot branch fromc6657a4 to455a925Compare

November 14, 2025 10:35

Make stream snapshots asynchronous

4d07c51

The monitorStream function can block for a long time when creating andinstalling a snapshot of a stream's state. This can lead to increasedtail latency.This commit extends the RaftNode interface with a new InstallSnapshotAsyncmethod. This method performs snapshot writing and WAL compaction in ain a separate goroutine, making the process non-blocking. The existingInstallSnapshot method is now a synchronous wrapper around the newasynchronous implementation.Signed-off-by: Daniele Sciascia <daniele@nats.io>

sciascid force-pushed theraft-two-step-snapshot branch from455a925 to4d07c51Compare

November 14, 2025 10:36

Labels

None yet

Movatterモバイル変換

Uh oh!

(2.14) Introduce asynchronous file writing for stream snapshots#7492

Are you sure you want to change the base?

(2.14) Introduce asynchronous file writing for stream snapshots#7492

Uh oh!

Conversation

sciascid commentedOct 30, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

sciascid commentedOct 30, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MauriceVanVeen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MauriceVanVeenNov 4, 2025

Choose a reason for hiding this comment

Uh oh!

sciascidNov 4, 2025

Choose a reason for hiding this comment

Uh oh!

sciascidNov 4, 2025

Choose a reason for hiding this comment

Uh oh!

MauriceVanVeenNov 5, 2025

Choose a reason for hiding this comment

Uh oh!

sciascid commentedNov 4, 2025

Uh oh!

MauriceVanVeen left a comment

Choose a reason for hiding this comment

Uh oh!

MauriceVanVeenNov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neilalexanderNov 13, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sciascidNov 14, 2025

Choose a reason for hiding this comment

Uh oh!

sciascidNov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sciascid commentedOct 30, 2025•
edited
Loading

sciascid commentedOct 30, 2025•
edited
Loading

neilalexanderNov 13, 2025•
edited
Loading