Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

basicPublish can freeze for very long time on network interface removal#995

sebek64 started this conversation inGeneral
Discussion options

  • RabbitMQ version: 3.9.21
  • Erlang version: 12.3.2.2
  • Client library version: 5.16.0
  • Operating system, version, and patch level: Linux, kernel 5.10.0
  • Java: openjdk version "17.0.5" 2022-10-18 LTS

Rabbit client can freeze during writing to socket when the network interface is removed. For example, we can run an app in docker, disconnect the network withdocker network disconnect ... command. If the connection is currently handlingbasicPublish, it is very likely that this call get stuck for a long time. No timeout configurations seem to help (SO_TIMEOUT, heartbeats,SO_KEEPALIVE, ...).

The thread is stuck with this stacktrace:

"DefaultDispatcher-worker-5" #315 daemon prio=5 os_prio=0 cpu=64.27ms elapsed=120.00s tid=0x00007fe9ecb2c650 nid=0x201 runnable  [0x00007fe9d74f6000]   java.lang.Thread.State: RUNNABLE        at sun.nio.ch.Net.poll(java.base@17.0.5/Native Method)        at sun.nio.ch.NioSocketImpl.park(java.base@17.0.5/NioSocketImpl.java:181)        at sun.nio.ch.NioSocketImpl.park(java.base@17.0.5/NioSocketImpl.java:190)        at sun.nio.ch.NioSocketImpl.implWrite(java.base@17.0.5/NioSocketImpl.java:415)        at sun.nio.ch.NioSocketImpl.write(java.base@17.0.5/NioSocketImpl.java:440)        at sun.nio.ch.NioSocketImpl$2.write(java.base@17.0.5/NioSocketImpl.java:826)        at java.net.Socket$SocketOutputStream.write(java.base@17.0.5/Socket.java:1045)        at java.io.BufferedOutputStream.flushBuffer(java.base@17.0.5/BufferedOutputStream.java:81)        at java.io.BufferedOutputStream.flush(java.base@17.0.5/BufferedOutputStream.java:142)        - locked <0x00000000c8b84988> (a java.io.BufferedOutputStream)        at java.io.DataOutputStream.flush(java.base@17.0.5/DataOutputStream.java:128)        at com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:197)        at com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:636)        at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:134)        at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:455)        - locked <0x00000000c8b2b308> (a java.lang.Object)        at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:428)        - locked <0x00000000c8b2b308> (a java.lang.Object)        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:710)        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:685)        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:675)...   Locked ownable synchronizers:        - <0x00000000c8b820b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

We can see that the sending buffer is occupied somehow innetstat output.

By the analysis of this library source code andNioSocketImpl sources, it is clear that the socket seems to be still in "recoverable" state. Theflush call is blocked, theimplWrite is still optimistic about the possibility to write more (but not yet).

Ideally, either the flush will throw an exception (but that doesn't happen), or we can detect "heartbeat timeouts" in this library and close the connection from outside.

If we try to implement this kind of behavior in the application itself, we fail. For example, if we time-out thebasicPublish call and then try toclose/abort the connection, it always tries to write something to the socket, so therefore it blocks as well.

For this reason, we believe that this is a bug in the library itself. However, very subtle and hard to fix.

You must be logged in to vote

Replies: 5 comments 1 reply

Comment options

This is a pretty esoteric situation. What would expedite us investigating it is if you provide a script or some other means that we can reproduce this easily. Ideally it would be as simple asdocker compose up.

You must be logged in to vote
0 replies
Comment options

Thanks for quick feedback. I'll try to prepare a simple simulation script.

You must be logged in to vote
0 replies
Comment options

This is how TCP works: it retries for a period of time before it declares the other end of the connection to be unresponsive.

Heartbeats (note that values < 5s are explicitly recommended against) andPublisher confirm reception timeouts will help.

TCP parameter tuning on client hosts can help, too. This is mentioned somewhat in the Heartbeats guide.

You must be logged in to vote
1 reply
@sebek64
Comment options

This is the small repro case:https://github.com/sebek64/repro

To run it, just do

./gradlew builddocker-compose -f docker-compose.yml up -d

Then, you can observe the logs by docker logs repro -f and at the same time, stop the network bydocker network disconnect repro_pluggable-network repro. You can observe that heartbeats exception is logged, but the connection is still alive. After reviving the connection bydocker network connect repro_pluggable-network repro, the app finally crashes, but spending quite a lot of time inbasicPublish.

The question of how natural this scenario could be is legitimate. For example, if we do an iptables DROP rule instead, the connection just fails correctly and quickly. I haven't tested cable unplugging, but it could be actually similar to network interface disappearing.

Anyway, I believe that the library should not rely on the fact that output stream flush method cannot just block.

Comment options

FWIW, we seem to be running into this exact problem regularly under high load (not sure if that is the trigger though).
Amazon MQ for RabbitMQ, version 3.8.30. The publisher is a custom Debezium Server connector for RabbitMQ (amqp-client 5.16.0) running on EKS. The connector runs fine for several hours until it suddenly gets stuck and does not recover even after hours. Only terminating the process helps.

This is the thread dump:

pool-7-thread-1 id=18 state=RUNNABLE (running in native)    at java.base@17.0.3/sun.nio.ch.Net.poll(Native Method)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl.park(Unknown Source)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl.park(Unknown Source)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl.implWrite(Unknown Source)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl.write(Unknown Source)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl$2.write(Unknown Source)    at java.base@17.0.3/java.net.Socket$SocketOutputStream.write(Unknown Source)    at java.base@17.0.3/sun.security.ssl.SSLSocketOutputRecord.deliver(Unknown Source)    at java.base@17.0.3/sun.security.ssl.SSLSocketImpl$AppOutputStream.write(Unknown Source)    at java.base@17.0.3/java.io.BufferedOutputStream.flushBuffer(Unknown Source)    at java.base@17.0.3/java.io.BufferedOutputStream.flush(Unknown Source)    at java.base@17.0.3/java.io.DataOutputStream.flush(Unknown Source)    at app//com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:197)    at app//com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:636)    at app//com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:134)    at app//com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:455)      - locked java.lang.Object@3e9644ec    at app//com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:428)      - locked java.lang.Object@3e9644ec    at app//com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:710)    at app//com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:685)    at app//com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:675)    at app//com.rabbitmq.client.impl.recovery.AutorecoveringChannel.basicPublish(AutorecoveringChannel.java:207)    at app//com.rabbitmq.client.RabbitMqChannelFactory_ProducerMethod_createAmqpChannel_8637ca800e0bf6e2ab56fd65a4ee28f4e7926dfa_ClientProxy.basicPublish(Unknown Source)    at app//com.xxx.debezium.server.rabbitmq.RabbitMqChangeConsumer.publishEventToRabbitMq(RabbitMqChangeConsumer.java:105)    at app//com.xxx.debezium.server.rabbitmq.RabbitMqChangeConsumer.publishEvent(RabbitMqChangeConsumer.java:91)    at app//com.xxx.debezium.server.rabbitmq.RabbitMqChangeConsumer.handleChangeEvent(RabbitMqChangeConsumer.java:55)    at app//com.xxx.debezium.server.rabbitmq.RabbitMqChangeConsumer.handleBatch(RabbitMqChangeConsumer.java:37)    at app//io.debezium.embedded.ConvertingEngineBuilder.lambda$notifying$2(ConvertingEngineBuilder.java:86)    at app//io.debezium.embedded.ConvertingEngineBuilder$$Lambda$222/0x0000000800e3fad0.handleBatch(Unknown Source)    at app//io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:913)    at app//io.debezium.embedded.ConvertingEngineBuilder$2.run(ConvertingEngineBuilder.java:195)    at app//io.debezium.server.DebeziumServer.lambda$start$1(DebeziumServer.java:161)    at app//io.debezium.server.DebeziumServer$$Lambda$278/0x0000000800e51700.run(Unknown Source)    at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)    at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)    at java.base@17.0.3/java.lang.Thread.run(Unknown Source)

Our current workaround is to wrap the publishing in anExecutorService#submit with a small timeout.

You must be logged in to vote
0 replies
Comment options

Thanks for providing some steps to reproduce, I'll investigate more shortly. In the meantime, you can try to:

You must be logged in to vote
0 replies
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
General
Labels
None yet
5 participants
@sebek64@michaelklishin@acogoluegnes@lukebakken@chromey
Converted from issue

This discussion was converted from issue #994 on March 20, 2023 15:19.


[8]ページ先頭

©2009-2025 Movatter.jp