- Notifications
You must be signed in to change notification settings - Fork776
Improve timeout and retry mechanic of exporters#4183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
LarsMichelsen commentedSep 11, 2024
I recognized the public symbols test being red (https://github.com/open-telemetry/opentelemetry-python/actions/runs/10760943700/job/29900823921). I understand it's purpose, but I guess there is some guidance needed on which names are acceptable to be new public names and which not. Any advice? Regarding the changelog: Is it intended the I add the change log entry? |
emdneto commentedSep 12, 2024
In that case we need a changelog for this one |
This is the first change in a chain of commits to rework the retrymechanic. It is based on the work ofopen-telemetry#3764 and basically trying to landthe changes proposed by this monolithic commit step by step.The plan is roughly to proceed in these steps:* Extract retry mechanic from GRPC exporters* Consolidate HTTP with GRPC exporter retry implementation* Pipe timeout through RetryingExporter* Make exporter lock protect the whole export instead of just a single iteration* Make timeout float instead of int* Add back-off with jitterIt's pretty likely that the plan will change along the way.
This unifies the implementation of the OTLP exporters and the HTTP logexporter.Next step is to consolidate the remaining HTTP exporters.Fixesopen-telemetry#4043.
This unifies the implementation of the OTLP exporters and the HTTPmetric exporter.
This unifies the implementation of the OTLP exporters and the HTTPtrace exporter.
Propagate the given timeout value from all exporters to theRetryingExporter, which will hand it over to the individual`_export` function call.Currently the same unaltered value is handed over. In followingcommits the Retrying exporter will update the timeout handed overto `_export` depending on the remaining time.
TL;DR The time assert in the test was wrong.Looking at the original commit (c84ba9) of this test, the delay of themocked backoff was 4 seconds fixed with 6 iterations. So the time untilshutdown returned was around 24 seconds. The default timeout of the lockwait in shutdown did not hit before the retry of the exporter was over.The line `self.assertGreaterEqual(now, (start_time + 30 / 1000))` seemedto be wrong from the beginning. It just ensured that the shutdown calltook at least 30ms. I assume it was supposed to do something else. Butto me it's not clear what exactly it should do.One explantion would be that it was intended to do`self.assertGreaterEqual(now, (start_time + 30 * 1000))` instead. Butthat would mean: Shutdown should need at least 30 seconds to pass, whichdoes not make sense imho.Another option would be that it was intended to do`self.assertLessEqual(now, (start_time + 30 * 1000))` instead. Thatwould mean: Shutdown should need at maximum 30 seconds to pass, whichactually could make more sense. It would ensure that the call of`self.exporter.shutdown` returns after 30 seconds.My biggest issue here is, that it tests the happy path where the retryis over before the timeout while there is no mechanism to ensure that theexporter thread is stopped in a sensible time in case it's currentexponential backoff is in a longer sleep already while the shutdown flagis set. If I understand it correctly, it can hit the beginning of a 32second sleep, which could mean that a process needs a) 30 seconds tosignal the shutdown and additional 32 seconds to finish the exporterthread. But that's something for another change.After all these words, looking at the current HEAD, the situation of thetest case is basically the same as originally. The retry interval hasbeen changed recently to be way shorter. The default lock timeout stayedthe same at 30 seconds, but that does not matter too much since theretry mechanism is finished long before that. And the aforementionedassert is still wrong, with some additional problem fromopen-telemetry#4014.
Previously the export lock was released by the export between retries,which gave a shutdown the opportunity to interrupt the export dependingon which thread is first to get the lock again. This behavior is notdesired as it heavily relies on timing and makes things lessdeterministic.The retry loop still checks whether the shutdown flag is set and willbreak in case it was told to do so.
A few tests are still commented out and will be enabled later, oncethe implementation is updated to implement the desired behavior.The tests are, like other existing tests, timing sensitive which makesthem prone to fail on slow systems. It would certainly be better to getrid of this issue. I haven't found another approach in the existing testcases so I sticked to the current pattern.
Replace the RetryableExporter shutdown flag and sleep with a`threading.Event` to make the waiting time for the next retryinterruptable.This enables us to change the semantic of the shutdown timeoutto work in a less surprising way. Previously the timeout workedroughly like this:1. Wait for the current export iteration to finish2. Try to get the lock faster than the export thread3. If the lock was acquired, the shutdown flag was set4. If the lock was not acquired, the shutdown flag was set5. The export thread would have picked this flag up after finishing the current retry cycle.This lead to hanging applications on shutdown, giving the applicationsno choice to influence the shutdown behavior. Uses could not decide howmuch time shall be spent trying to prevent the loss of data.This change allows shutdown to interrupt the current export waiting callso that the exporter can be shut down with a user provided deadline.
Derive a deadline from the initial timeout for each export_with_retrycall and try to stick to this time frame as good as possible.Each call to export_function will get the remaining time as timeoutparameter and is expected to return as quick in case of an issue sothat export_with_retry can care for waiting and retrying.This also removes the surprsing deadline triggered by the max_valueof the exponential backoff.
This change allows the user to call retrying exports with individiualtimeouts. This argument will be needed in the next step, where the GRPCmetric exporters batch processing is changed to respect the timeout forthe whole export.
The timeout passed as argument to the export method is now overridingthe default values set during initialization) once given.Additonally also ensure the batching in OTLPMetricExporter.exportrespects the export timeout.
The timeout passed as argument to the export method is now overridingthe default values set during initialization if given.This change is made to align the API of all exporters. In the metricexporter this parameter was already existant but was not used. This waschanged with the previous commit (2102e3b25b52).
It's desirable to have a common API across all exporters.Since the previously existing timeout_millis argument wasn't usedanyways, it should not be an issue to change the default value here.The None case will fall back to the object level timeout, which isthe same as the previous default argument value.
defd3b8 to62a9101CompareLarsMichelsen commentedSep 14, 2024
To my understanding the following new public symbols are necessary: If I understood CONTRIBUTING.md correctly, I can not do anything further about it and now some maintainer / code owner needs to judge it. |
emdneto commentedNov 17, 2025
Closing since#4564 was merged |
Uh oh!
There was an error while loading.Please reload this page.
Description
I took a deep look into#3764, tried to separate the individual changes. While I think the approach of the PR in general is good, I changed some aspects of the change. For example, I don't agree with the change picking the lowest of the given timeouts (seehere).
Having the change split into multiple steps, it's still a big undertaking. So I guess a reviewer will have to take some time to get into it. Nevertheless, I hope that someone will be found who feels up to it.
Fixes#4043
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Several tests were added (retry logic; shutdown logic; new timeout export argument) and of course executed also :)
Does This PR Require a Contrib Repo Change?
Checklist: