rabbitmq/rabbitmq-java-clientPublic

NotificationsYou must be signed in to change notification settings
Fork586
Star1.3k

basicPublish can freeze for very long time on network interface removal#995

sebek64 started this conversation inGeneral

sebek64

Mar 20, 2023

· 5 comments· 1 reply

Return to top

Discussion options

sebek64
Mar 20, 2023

RabbitMQ version: 3.9.21
Erlang version: 12.3.2.2
Client library version: 5.16.0
Operating system, version, and patch level: Linux, kernel 5.10.0
Java: openjdk version "17.0.5" 2022-10-18 LTS

Rabbit client can freeze during writing to socket when the network interface is removed. For example, we can run an app in docker, disconnect the network withdocker network disconnect ... command. If the connection is currently handlingbasicPublish, it is very likely that this call get stuck for a long time. No timeout configurations seem to help (SO_TIMEOUT, heartbeats,SO_KEEPALIVE, ...).

The thread is stuck with this stacktrace:

"DefaultDispatcher-worker-5" #315 daemon prio=5 os_prio=0 cpu=64.27ms elapsed=120.00s tid=0x00007fe9ecb2c650 nid=0x201 runnable  [0x00007fe9d74f6000]   java.lang.Thread.State: RUNNABLE        at sun.nio.ch.Net.poll(java.base@17.0.5/Native Method)        at sun.nio.ch.NioSocketImpl.park(java.base@17.0.5/NioSocketImpl.java:181)        at sun.nio.ch.NioSocketImpl.park(java.base@17.0.5/NioSocketImpl.java:190)        at sun.nio.ch.NioSocketImpl.implWrite(java.base@17.0.5/NioSocketImpl.java:415)        at sun.nio.ch.NioSocketImpl.write(java.base@17.0.5/NioSocketImpl.java:440)        at sun.nio.ch.NioSocketImpl$2.write(java.base@17.0.5/NioSocketImpl.java:826)        at java.net.Socket$SocketOutputStream.write(java.base@17.0.5/Socket.java:1045)        at java.io.BufferedOutputStream.flushBuffer(java.base@17.0.5/BufferedOutputStream.java:81)        at java.io.BufferedOutputStream.flush(java.base@17.0.5/BufferedOutputStream.java:142)        - locked <0x00000000c8b84988> (a java.io.BufferedOutputStream)        at java.io.DataOutputStream.flush(java.base@17.0.5/DataOutputStream.java:128)        at com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:197)        at com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:636)        at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:134)        at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:455)        - locked <0x00000000c8b2b308> (a java.lang.Object)        at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:428)        - locked <0x00000000c8b2b308> (a java.lang.Object)        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:710)        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:685)        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:675)...   Locked ownable synchronizers:        - <0x00000000c8b820b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

We can see that the sending buffer is occupied somehow innetstat output.

By the analysis of this library source code andNioSocketImpl sources, it is clear that the socket seems to be still in "recoverable" state. Theflush call is blocked, theimplWrite is still optimistic about the possibility to write more (but not yet).

Ideally, either the flush will throw an exception (but that doesn't happen), or we can detect "heartbeat timeouts" in this library and close the connection from outside.

If we try to implement this kind of behavior in the application itself, we fail. For example, if we time-out thebasicPublish call and then try toclose/abort the connection, it always tries to write something to the socket, so therefore it blocks as well.

For this reason, we believe that this is a bug in the library itself. However, very subtle and hard to fix.

You must be logged in to vote

Replies: 5 comments 1 reply

Comment options

lukebakken
Mar 20, 2023
Maintainer

This is a pretty esoteric situation. What would expedite us investigating it is if you provide a script or some other means that we can reproduce this easily. Ideally it would be as simple asdocker compose up.

You must be logged in to vote

0 replies

Comment options

sebek64
Mar 20, 2023
Author

Thanks for quick feedback. I'll try to prepare a simple simulation script.

You must be logged in to vote

0 replies

Comment options

michaelklishin
Mar 20, 2023
Maintainer

This is how TCP works: it retries for a period of time before it declares the other end of the connection to be unresponsive.

Heartbeats (note that values < 5s are explicitly recommended against) andPublisher confirm reception timeouts will help.

TCP parameter tuning on client hosts can help, too. This is mentioned somewhat in the Heartbeats guide.

You must be logged in to vote

1 reply

Comment options

sebek64 Mar 21, 2023
Author

This is the small repro case:https://github.com/sebek64/repro

To run it, just do

./gradlew builddocker-compose -f docker-compose.yml up -d

Then, you can observe the logs by docker logs repro -f and at the same time, stop the network bydocker network disconnect repro_pluggable-network repro. You can observe that heartbeats exception is logged, but the connection is still alive. After reviving the connection bydocker network connect repro_pluggable-network repro, the app finally crashes, but spending quite a lot of time inbasicPublish.

The question of how natural this scenario could be is legitimate. For example, if we do an iptables DROP rule instead, the connection just fails correctly and quickly. I haven't tested cable unplugging, but it could be actually similar to network interface disappearing.

Anyway, I believe that the library should not rely on the fact that output stream flush method cannot just block.

Comment options

chromey
Apr 14, 2023

FWIW, we seem to be running into this exact problem regularly under high load (not sure if that is the trigger though).
Amazon MQ for RabbitMQ, version 3.8.30. The publisher is a custom Debezium Server connector for RabbitMQ (amqp-client 5.16.0) running on EKS. The connector runs fine for several hours until it suddenly gets stuck and does not recover even after hours. Only terminating the process helps.

This is the thread dump:

pool-7-thread-1 id=18 state=RUNNABLE (running in native)    at java.base@17.0.3/sun.nio.ch.Net.poll(Native Method)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl.park(Unknown Source)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl.park(Unknown Source)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl.implWrite(Unknown Source)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl.write(Unknown Source)    at java.base@17.0.3/sun.nio.ch.NioSocketImpl$2.write(Unknown Source)    at java.base@17.0.3/java.net.Socket$SocketOutputStream.write(Unknown Source)    at java.base@17.0.3/sun.security.ssl.SSLSocketOutputRecord.deliver(Unknown Source)    at java.base@17.0.3/sun.security.ssl.SSLSocketImpl$AppOutputStream.write(Unknown Source)    at java.base@17.0.3/java.io.BufferedOutputStream.flushBuffer(Unknown Source)    at java.base@17.0.3/java.io.BufferedOutputStream.flush(Unknown Source)    at java.base@17.0.3/java.io.DataOutputStream.flush(Unknown Source)    at app//com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:197)    at app//com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:636)    at app//com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:134)    at app//com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:455)      - locked java.lang.Object@3e9644ec    at app//com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:428)      - locked java.lang.Object@3e9644ec    at app//com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:710)    at app//com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:685)    at app//com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:675)    at app//com.rabbitmq.client.impl.recovery.AutorecoveringChannel.basicPublish(AutorecoveringChannel.java:207)    at app//com.rabbitmq.client.RabbitMqChannelFactory_ProducerMethod_createAmqpChannel_8637ca800e0bf6e2ab56fd65a4ee28f4e7926dfa_ClientProxy.basicPublish(Unknown Source)    at app//com.xxx.debezium.server.rabbitmq.RabbitMqChangeConsumer.publishEventToRabbitMq(RabbitMqChangeConsumer.java:105)    at app//com.xxx.debezium.server.rabbitmq.RabbitMqChangeConsumer.publishEvent(RabbitMqChangeConsumer.java:91)    at app//com.xxx.debezium.server.rabbitmq.RabbitMqChangeConsumer.handleChangeEvent(RabbitMqChangeConsumer.java:55)    at app//com.xxx.debezium.server.rabbitmq.RabbitMqChangeConsumer.handleBatch(RabbitMqChangeConsumer.java:37)    at app//io.debezium.embedded.ConvertingEngineBuilder.lambda$notifying$2(ConvertingEngineBuilder.java:86)    at app//io.debezium.embedded.ConvertingEngineBuilder$$Lambda$222/0x0000000800e3fad0.handleBatch(Unknown Source)    at app//io.debezium.embedded.EmbeddedEngine.run(EmbeddedEngine.java:913)    at app//io.debezium.embedded.ConvertingEngineBuilder$2.run(ConvertingEngineBuilder.java:195)    at app//io.debezium.server.DebeziumServer.lambda$start$1(DebeziumServer.java:161)    at app//io.debezium.server.DebeziumServer$$Lambda$278/0x0000000800e51700.run(Unknown Source)    at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)    at java.base@17.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)    at java.base@17.0.3/java.lang.Thread.run(Unknown Source)

Our current workaround is to wrap the publishing in anExecutorService#submit with a small timeout.

You must be logged in to vote

0 replies

Comment options