When a cluster member is lost due to some hardware failures,rabbitmqctl forget_cluster_node (or other QQ shrink actions) can work through the set of QQs slowly. A hardware failure may cause a node to 'disappear' according to the other nodes in the cluster. Forgetting the lost member can take a long time per-queue.forget_cluster_node is an easy way to reproduce this but not the only one:rabbit_quorum_queue:shrink_all/1 is also used by peer discovery cleanup (opt-in feature) and QQCMR has a similar method of working through the queues withrabbit_quorum_queue:delete_member/2.

Reproduction

I have three EC2 nodes withservice tags set torabbitmq sharing an Erlang cookie and using this config:

# ~/config.confcluster_formation.peer_discovery_backend = awscluster_formation.aws.region = <region>cluster_formation.aws.access_key_id = <access key>cluster_formation.aws.secret_key = <secret key>cluster_formation.aws.instance_tags.service = rabbitmq

Then on each node I runmake RABBITMQ_CONFIG_FILE=~/config.conf run-broker usingmain and OTP 27. Then declare a large number of queues:perf-test -qq -qpf 1 -qpt 1000 -qp qq-%d -x 1 -y 0 --time 1.

Then we simulate a hardware failure where a node effectively becomes partitioned network-wise using IP-tables.

# block traffic to/from the other brokers (node C blocking A and B here):A=172.31.30.253B=172.31.17.251iptables -A INPUT -s$A -j DROP&& iptables -A OUTPUT -d$A -j DROP&& \    iptables -A INPUT -s$B -j DROP&& iptables -A OUTPUT -d$B -j DROP

If there are any QQs with leaders on that node, the majority-side of the cluster will soon-after take leadership. Then nodes A and B recognize that C is unreachable:

2025-12-03 16:12:29.936467+00:00 [error] <0.382.0> ** Node 'rabbit@ip-172-31-26-76' not responding **2025-12-03 16:12:29.936467+00:00 [error] <0.382.0> ** Removing (timedout) connection **2025-12-03 16:12:29.936467+00:00 [error] <0.382.0> 2025-12-03 16:12:29.939262+00:00 [info] <0.514.0> rabbit on node 'rabbit@ip-172-31-26-76' down2025-12-03 16:12:29.941314+00:00 [info] <0.514.0> node 'rabbit@ip-172-31-26-76' down: net_tick_timeout

Now if we runrabbitmqctl forget_cluster_node rabbit@ip-172-31-26-76 from A or B, we can see that C is removed from each QQ rather slowly:

2025-12-03 16:13:16.958658+00:00 [info] <0.114280.0> Will remove all queues from node rabbit@ip-172-31-26-76. The node is likely being removed from the cluster.2025-12-03 16:13:16.960172+00:00 [info] <0.119227.0> Asked to remove all quorum queue replicas from node rabbit@ip-172-31-26-762025-12-03 16:13:16.961481+00:00 [info] <0.119227.0> queue 'qq-71' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:13:23.965503+00:00 [info] <0.119227.0> queue 'qq-335' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:13:30.970583+00:00 [info] <0.119227.0> queue 'qq-476' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:13:37.976568+00:00 [info] <0.119227.0> queue 'qq-411' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:13:44.981565+00:00 [info] <0.119227.0> queue 'qq-610' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:13:51.986743+00:00 [info] <0.119227.0> queue 'qq-736' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:13:58.991849+00:00 [info] <0.119227.0> queue 'qq-566' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:14:05.996656+00:00 [info] <0.119227.0> queue 'qq-244' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'

Note the seven seconds between each attempted queue membership removal. Leaving this test to run overnight, it takes around 1hr57min to finish shrinking the member off of the 1000 queues.

Analysis

rabbit_quorum_queue:delete_member/2 has three potentially expensive components:

ra:remove_member/3. This acts against the (maybe newly elected) leader, though, and is usually very quick.
rabbit_amqqueue:update/2 to remove the node from themembers list in the queue type state. This is also quick since the first step offorget_cluster_node is to remove the node from the metadata-store membership.
ra:force_delete_server/2 whenra:remove_member/3 succeeds. This is where the seven seconds come from.

ra:force_delete_server/2 ultimately callsra_server_sup_sup:stop_server/2 which performs anrpc:call/4 to the failed node. Because of the node disappeared at the network level, Erlang forgets about its distribution table entry and attempts to form a new connection to the node. By default the connect timeout isseven seconds innet_kernel, so this repeatedly waits for up to seven seconds if the destination is not reachable and is not responding. Shrinking is usually very very fast, completing in single-digit minute times even for thousands of queues, as long as the lost member hung up the network connection gracefully. But when it disappears from net_tick_timeout rather than connection_down, shrinking is slow.

This situation is benign since all QQs are have a quorum of active members. If there are other membership changes during this long window, like a new instance joining, the quorum can become threatened since the membership increases to four nodes for any QQs waiting to have the original node removed. If any instance then fails, the membership would drop to 2/4 active members on some QQs and progress on those QQs would be blocked.

You must be logged in to vote

Replies: 3 comments 4 replies

Comment options

the-mikedavis
Dec 3, 2025
Maintainer Author

One quick workaround for this is to turn off distribution auto-connection on the node performing the deletion:

rabbitmqctl eval application:set_env(kernel,dist_auto_connect,never).

Notice at 16:15:09 in this test when this option kicks in:

2025-12-03 16:14:34.016557+00:00 [info] <0.119227.0> queue 'qq-912' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:14:41.020594+00:00 [info] <0.119227.0> queue 'qq-757' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:14:48.025572+00:00 [info] <0.119227.0> queue 'qq-273' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:14:55.030541+00:00 [info] <0.119227.0> queue 'qq-59' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:15:02.035549+00:00 [info] <0.119227.0> queue 'qq-557' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:15:09.040579+00:00 [info] <0.119227.0> queue 'qq-171' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:15:09.044107+00:00 [info] <0.119227.0> queue 'qq-166' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:15:09.047423+00:00 [info] <0.119227.0> queue 'qq-429' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'2025-12-03 16:15:09.050673+00:00 [info] <0.119227.0> queue 'qq-443' in vhost '/': removing member (replica) on node 'rabbit@ip-172-31-26-76'

Once the option is set, shrinking completes very quickly since the seven-second timeout is eliminated. That option is harmless to set temporarily on a node which is not attempting to join other nodes. New instances which launch while the option is set may still join this node. But this option prevents this node from joining others, so once the shrink is complete, it's a good idea to unset it:

rabbitmqctl eval application:unset_env(kernel,dist_auto_connect).

You must be logged in to vote

2 replies

Comment options

the-mikedavis Dec 3, 2025
Maintainer Author

I haven't looked very far into this but it would be interesting if we could setdist_auto_connect tonever and still have clustering work correctly. Instead of using automatic connections likerpc:call/4 ornet_adm:ping/1 we would need to connect to new nodes explicitly withnet_kernel:connect_node/1. I tried this on main with this block:

rabbitmq-server/deps/rabbit/src/rabbit_khepri.erl

Lines 552 to 554 in9c56475

	%% Ensure the remote node is reachable before we add it.
	casenet_adm:ping(RemoteNode)of
	pong ->

modified to usenet_kernel:connect_node/1 but that isn't enough to make this work on its own. (To be investigated further)

Comment options

the-mikedavis Dec 4, 2025
Maintainer Author

I looked more into this 'explicit connections' idea. Initial clustering works if we updaterabbit_ff_controller:rpc_calls/5 to usenet_kernel:connect_node/1 but using explicit connections means that partitioned nodes don't automatically reconnect when the partition resolves. They don't return to the cluster until they're manually joined back withrabbitmqctl join_cluster. So this doesn't seem very useful to me.

Comment options

the-mikedavis
Dec 4, 2025
Maintainer Author

The easiest change to make here might be to batch and parallelizerabbit_quorum_queue:shrink_all/1. Currently QQs are shrunk in serial.rabbit_quorum_queue:delete_member/2 is normally fairly fast, but updating a QQ's membership and updating the metadata store synchronously one-by-one is relatively slow anyways. This would cut down on the total shrink time pretty well depending on how many QQs are shrunk in a batch.

You must be logged in to vote

2 replies

Comment options

michaelklishin Dec 4, 2025
Maintainer

This sounds reasonable. There should be no inter-dependencies between the queues in this case. Even with the shared metadata store, each queue uses a separate key.

This is definitely something worth trying.

Comment options

the-mikedavis Dec 8, 2025
Maintainer Author

#15081 takes this route - we can chunk the list of queues and shrink batches in parallel

Comment options

the-mikedavis
Dec 8, 2025
Maintainer Author

@kjnilsson mentioned in#15081 (comment) that we could pipeline commands to make the WAL parts of this more efficient (rather than spawning many processes). We need some changes in Ra to add functions to make membership changes through pipeline commands (rabbitmq/ra#566). Then once mnesia is eliminated we can use an async update in Khepri to update the queue type state in the metadata store. To cut down on the 7s timeout we could then perform thera:force_delete_server/2 for all queues in one RPC call. And in the long run we could apply the same strategies to QQ growth (rabbit_quorum_queue:grow/5).

You must be logged in to vote

0 replies

Movatterモバイル変換

[Other] Shrinking QQs of a partitioned cluster member can be slow#15057

Uh oh!

the-mikedavisDec 3, 2025 Maintainer

Community Support Policy

RabbitMQ version used

How is RabbitMQ deployed?

Steps to reproduce the behavior in question

Reproduction

Analysis

Replies: 3 comments· 4 replies

Uh oh!

the-mikedavisDec 3, 2025 Maintainer Author

Uh oh!

the-mikedavisDec 3, 2025 Maintainer Author

Uh oh!

the-mikedavisDec 4, 2025 Maintainer Author

Uh oh!

the-mikedavisDec 4, 2025 Maintainer Author

Uh oh!

michaelklishinDec 4, 2025 Maintainer

Uh oh!

the-mikedavisDec 8, 2025 Maintainer Author

Uh oh!

the-mikedavisDec 8, 2025 Maintainer Author

Uh oh!

the-mikedavis
Dec 3, 2025
Maintainer

Replies: 3 comments 4 replies

the-mikedavis
Dec 3, 2025
Maintainer Author

the-mikedavis Dec 3, 2025
Maintainer Author

the-mikedavis Dec 4, 2025
Maintainer Author

the-mikedavis
Dec 4, 2025
Maintainer Author

michaelklishin Dec 4, 2025
Maintainer

the-mikedavis Dec 8, 2025
Maintainer Author

the-mikedavis
Dec 8, 2025
Maintainer Author