- Notifications
You must be signed in to change notification settings - Fork2.1k
[Resilience] Integrate Polly for grain placement resilience and add configuration options#9819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull request overview
This PR integrates the Polly resilience framework into Orleans' grain placement system to add retry and timeout capabilities for placement operations, making the system more robust against transient failures.
- Adds Polly-based resilience pipeline with configurable retry and timeout policies for grain placement
- Introduces three new configuration options in
SiloMessagingOptions:PlacementTimeout,PlacementMaxRetries, andPlacementRetryBaseDelay - Refactors placement logic to execute through a resilience pipeline with message expiration checks
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| Directory.Packages.props | Adds Polly (8.5.2) and Polly.Extensions (8.6.4) package references |
| src/Orleans.Core/Orleans.Core.csproj | Adds Polly package references to Orleans.Core |
| src/Orleans.Runtime/Orleans.Runtime.csproj | Adds Polly.Extensions package reference to Orleans.Runtime |
| src/Orleans.Runtime/OrleansRuntimeResiliencePolicies.cs | New file defining resilience policies with timeout and retry strategies for placement operations |
| src/Orleans.Runtime/Configuration/Options/SiloMessagingOptions.cs | Adds three new configuration properties for placement resilience with defaults |
| src/Orleans.Runtime/Hosting/DefaultSiloServices.cs | Registers Orleans runtime resilience policies during service configuration |
| src/Orleans.Runtime/Placement/PlacementService.cs | Integrates resilience pipeline into placement workflow with message expiration checks and context struct to avoid allocations |
| builder | ||
| .AddTimeout(newTimeoutStrategyOptions | ||
| { | ||
| Timeout=options.PlacementTimeout, | ||
| OnTimeout= args=> | ||
| { | ||
| logger.LogWarning("Grain placement operation timed out after {Timeout}.",options.PlacementTimeout); | ||
| returndefault; | ||
| } | ||
| }) | ||
| .AddRetry(newRetryStrategyOptions | ||
| { | ||
| MaxRetryAttempts=options.PlacementMaxRetries, | ||
| BackoffType=DelayBackoffType.Exponential, | ||
| Delay=options.PlacementRetryBaseDelay, | ||
| UseJitter=true, | ||
| ShouldHandle=newPredicateBuilder().Handle<Exception>(ex=>IsTransientPlacementException(ex)), | ||
| OnRetry= args=> | ||
| { | ||
| logger.LogDebug(args.Outcome.Exception,"Retrying grain placement operation. Attempt: {AttemptNumber}, Delay: {RetryDelay}.",args.AttemptNumber,args.RetryDelay); | ||
| returndefault; | ||
| } | ||
| }); |
CopilotAINov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The resilience pipeline order is incorrect. The timeout is added before the retry strategy, which means the timeout applies to the entire operation including all retries, not to individual retry attempts. This should be reversed: add retry first, then timeout, so each retry attempt has its own timeout.
builder.AddRetry(newRetryStrategyOptions{ ...}).AddTimeout(newTimeoutStrategyOptions{ ...});
With the current order, ifPlacementTimeout is 30 seconds and retries take 10 seconds each, the operation would fail after 30 seconds even if retries are still within the retry limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That's precisely what we want.
| exceptionswitch | ||
| { | ||
| OrleansException=>true, | ||
| TimeoutException=>true, |
CopilotAINov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Retrying onTimeoutException creates a logical conflict. When aTimeoutException is thrown by Polly's timeout strategy, retrying it will likely just hit the same timeout again. This exception should either not be retried, or the timeout configuration needs to be reconsidered (see comment about timeout/retry ordering).
If the intent is to handle timeouts from operationswithin the placement logic (not from Polly's timeout strategy), those should be caught and wrapped in a different exception type to distinguish them from Polly's timeout.
| TimeoutException=>true, | |
| TimeoutException=>false, |
| if(firstMessage.IsExpired) | ||
| { | ||
| LogMessageExpiredDuringPlacement(_logger,firstMessage.TargetGrain); | ||
| thrownewTimeoutException($"Message expired before placement could complete for grain{firstMessage.TargetGrain}."); | ||
| } |
CopilotAINov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The message expiration check throws aTimeoutException, but the resilience policy is configured to retryTimeoutException (line 71 of OrleansRuntimeResiliencePolicies.cs). This means expired messages will be unnecessarily retried, which is wasteful since they will fail again immediately.
Consider using a different exception type for message expiration (e.g., a customMessageExpiredException orOperationCanceledException) that won't trigger retries, or exclude this specific case from retry logic.
| if(firstMessage.IsExpired) | ||
| { | ||
| LogMessageExpiredDuringPlacement(_logger,targetGrain); | ||
| thrownewTimeoutException($"Message expired before placement could complete for grain{targetGrain}."); | ||
| } |
CopilotAINov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Same issue as the first expiration check: thisTimeoutException will be retried by the resilience policy even though the message has already expired. See comment on lines 381-385.
| usingSystem; | ||
| usingMicrosoft.Extensions.DependencyInjection; | ||
| usingMicrosoft.Extensions.Logging; | ||
| usingMicrosoft.Extensions.Options; | ||
| usingOrleans.Configuration; | ||
| usingOrleans.Runtime.Placement; | ||
| usingPolly; | ||
| usingPolly.Retry; | ||
| usingPolly.Timeout; | ||
| namespaceOrleans.Runtime; | ||
| /// <summary> | ||
| /// Resilience policies used by the Orleans runtime. | ||
| /// </summary> | ||
| publicstaticclassOrleansRuntimeResiliencePolicies | ||
| { | ||
| /// <summary> | ||
| /// The key used to identify the placement resilience pipeline in <see cref="Polly.Registry.ResiliencePipelineProvider{TKey}"/>. | ||
| /// </summary> | ||
| publicconststringPlacementResiliencePipelineKey="Orleans.Placement"; | ||
| /// <summary> | ||
| /// Adds all Orleans runtime resilience policies to the service collection. | ||
| /// </summary> | ||
| /// <param name="services">The service collection.</param> | ||
| /// <returns>The service collection for chaining.</returns> | ||
| internalstaticIServiceCollectionAddOrleansRuntimeResiliencePolicies(IServiceCollectionservices) | ||
| { | ||
| // Placement resilience pipeline | ||
| services.AddResiliencePipeline(PlacementResiliencePipelineKey,static(builder,context)=> | ||
| { | ||
| varoptions=context.ServiceProvider.GetRequiredService<IOptions<SiloMessagingOptions>>().Value; | ||
| varlogger=context.ServiceProvider.GetRequiredService<ILogger<PlacementService>>(); | ||
| builder | ||
| .AddTimeout(newTimeoutStrategyOptions | ||
| { | ||
| Timeout=options.PlacementTimeout, | ||
| OnTimeout= args=> | ||
| { | ||
| logger.LogWarning("Grain placement operation timed out after {Timeout}.",options.PlacementTimeout); | ||
| returndefault; | ||
| } | ||
| }) | ||
| .AddRetry(newRetryStrategyOptions | ||
| { | ||
| MaxRetryAttempts=options.PlacementMaxRetries, | ||
| BackoffType=DelayBackoffType.Exponential, | ||
| Delay=options.PlacementRetryBaseDelay, | ||
| UseJitter=true, | ||
| ShouldHandle=newPredicateBuilder().Handle<Exception>(ex=>IsTransientPlacementException(ex)), | ||
| OnRetry= args=> | ||
| { | ||
| logger.LogDebug(args.Outcome.Exception,"Retrying grain placement operation. Attempt: {AttemptNumber}, Delay: {RetryDelay}.",args.AttemptNumber,args.RetryDelay); | ||
| returndefault; | ||
| } | ||
| }); | ||
| }); | ||
| returnservices; | ||
| } | ||
| /// <summary> | ||
| /// Determines whether an exception is transient and should be retried during placement operations. | ||
| /// </summary> | ||
| privatestaticboolIsTransientPlacementException(Exceptionexception)=> | ||
| exceptionswitch | ||
| { | ||
| OrleansException=>true, | ||
| TimeoutException=>true, | ||
| OperationCanceledException=>false, | ||
| _=>false | ||
| }; | ||
| } |
CopilotAINov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The new resilience features for grain placement lack test coverage. Consider adding tests for:
- Retry behavior on transient failures
- Timeout behavior
- Interaction between timeout and retry policies
- Message expiration scenarios during retries
- Configuration options validation
This is especially important given the complexity of the retry/timeout interaction and the bugs identified in the resilience policy configuration.
| /// This timeout applies to the entire placement operation including retries. | ||
| /// If the timeout is exceeded, the placement operation will fail with a <see cref="TimeoutException"/>. |
CopilotAINov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The documentation states "This timeout applies to the entire placement operation including retries," which accurately describes the current implementation but may not be the intended behavior.
With the current pipeline order (timeout before retry), a single timeout covers all retry attempts. This means if retries take longer than the timeout duration, the operation will fail even if individual attempts would have succeeded.
If this is intentional, the documentation is correct. However, if each retry attempt should have its own timeout (which is more typical), this documentation should be updated to reflect that change when the pipeline order is corrected. See the comment on OrleansRuntimeResiliencePolicies.cs lines 36-58.
| /// This timeout applies tothe entire placement operationincluding retries. | |
| /// Ifthe timeout is exceeded, the placement operation will fail with a <see cref="TimeoutException"/>. | |
| /// This timeout applies toeach individual placementattempt. If a single attempt exceeds the timeout, it will fail and theoperationwill retry up to <see cref="PlacementMaxRetries"/> times. | |
| /// Ifall attempts fail due to timeout, the placement operation will fail with a <see cref="TimeoutException"/>. |
| <PackageVersionInclude="OpenTelemetry.Instrumentation.Http"Version="1.9.0" /> | ||
| <PackageVersionInclude="OpenTelemetry.Instrumentation.Runtime"Version="1.9.0" /> | ||
| <PackageVersionInclude="Polly"Version="8.5.2" /> | ||
| <PackageVersionInclude="Polly.Extensions"Version="8.6.4" /> |
CopilotAINov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The package version forPolly.Extensions is 8.6.4 whilePolly is at 8.5.2. This version mismatch could lead to compatibility issues. Both packages should use the same major.minor version to ensure they work correctly together.
| <PackageVersionInclude="Polly.Extensions"Version="8.6.4" /> | |
| <PackageVersionInclude="Polly.Extensions"Version="8.5.2" /> |
…onfiguration options
d6580ef tod5457b0Compare
Uh oh!
There was an error while loading.Please reload this page.
Microsoft Reviewers:Open in CodeFlow