- Notifications
You must be signed in to change notification settings - Fork4k
[Other] Shrinking QQs of a partitioned cluster member can be slow#15057
-
Community Support Policy
RabbitMQ version used4.2.x How is RabbitMQ deployed?Other Steps to reproduce the behavior in questionWhen a cluster member is lost due to some hardware failures, ReproductionI have three EC2 nodes with # ~/config.confcluster_formation.peer_discovery_backend = awscluster_formation.aws.region = <region>cluster_formation.aws.access_key_id = <access key>cluster_formation.aws.secret_key = <secret key>cluster_formation.aws.instance_tags.service = rabbitmq Then on each node I run Then we simulate a hardware failure where a node effectively becomes partitioned network-wise using IP-tables. # block traffic to/from the other brokers (node C blocking A and B here):A=172.31.30.253B=172.31.17.251iptables -A INPUT -s$A -j DROP&& iptables -A OUTPUT -d$A -j DROP&& \ iptables -A INPUT -s$B -j DROP&& iptables -A OUTPUT -d$B -j DROP If there are any QQs with leaders on that node, the majority-side of the cluster will soon-after take leadership. Then nodes A and B recognize that C is unreachable: Now if we run Note the seven seconds between each attempted queue membership removal. Leaving this test to run overnight, it takes around 1hr57min to finish shrinking the member off of the 1000 queues. Analysis
This situation is benign since all QQs are have a quorum of active members. If there are other membership changes during this long window, like a new instance joining, the quorum can become threatened since the membership increases to four nodes for any QQs waiting to have the original node removed. If any instance then fails, the membership would drop to 2/4 active members on some QQs and progress on those QQs would be blocked. |
BetaWas this translation helpful?Give feedback.
All reactions
Replies: 3 comments 4 replies
-
One quick workaround for this is to turn off distribution auto-connection on the node performing the deletion: Notice at 16:15:09 in this test when this option kicks in: Once the option is set, shrinking completes very quickly since the seven-second timeout is eliminated. That option is harmless to set temporarily on a node which is not attempting to join other nodes. New instances which launch while the option is set may still join this node. But this option prevents this node from joining others, so once the shrink is complete, it's a good idea to unset it: |
BetaWas this translation helpful?Give feedback.
All reactions
-
I haven't looked very far into this but it would be interesting if we could set rabbitmq-server/deps/rabbit/src/rabbit_khepri.erl Lines 552 to 554 in9c56475
modified to use |
BetaWas this translation helpful?Give feedback.
All reactions
-
I looked more into this 'explicit connections' idea. Initial clustering works if we update |
BetaWas this translation helpful?Give feedback.
All reactions
-
The easiest change to make here might be to batch and parallelize |
BetaWas this translation helpful?Give feedback.
All reactions
-
This sounds reasonable. There should be no inter-dependencies between the queues in this case. Even with the shared metadata store, each queue uses a separate key. This is definitely something worth trying. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
#15081 takes this route - we can chunk the list of queues and shrink batches in parallel |
BetaWas this translation helpful?Give feedback.
All reactions
-
@kjnilsson mentioned in#15081 (comment) that we could pipeline commands to make the WAL parts of this more efficient (rather than spawning many processes). We need some changes in Ra to add functions to make membership changes through pipeline commands (rabbitmq/ra#566). Then once mnesia is eliminated we can use an async update in Khepri to update the queue type state in the metadata store. To cut down on the 7s timeout we could then perform the |
BetaWas this translation helpful?Give feedback.