Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[Resilience] Integrate Polly for grain placement resilience and add configuration options#9819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ReubenBond wants to merge1 commit intodotnet:main
base:main
Choose a base branch
Loading
fromReubenBond:fix/connection-resilience/1

Conversation

@ReubenBond
Copy link
Member

@ReubenBondReubenBond commentedNov 25, 2025
edited by dotnet-policy-servicebot
Loading

Microsoft Reviewers:Open in CodeFlow

CopilotAI review requested due to automatic review settingsNovember 25, 2025 22:48
Copilot finished reviewing on behalf ofReubenBondNovember 25, 2025 22:51
Copy link
Contributor

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull request overview

This PR integrates the Polly resilience framework into Orleans' grain placement system to add retry and timeout capabilities for placement operations, making the system more robust against transient failures.

  • Adds Polly-based resilience pipeline with configurable retry and timeout policies for grain placement
  • Introduces three new configuration options inSiloMessagingOptions:PlacementTimeout,PlacementMaxRetries, andPlacementRetryBaseDelay
  • Refactors placement logic to execute through a resilience pipeline with message expiration checks

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
FileDescription
Directory.Packages.propsAdds Polly (8.5.2) and Polly.Extensions (8.6.4) package references
src/Orleans.Core/Orleans.Core.csprojAdds Polly package references to Orleans.Core
src/Orleans.Runtime/Orleans.Runtime.csprojAdds Polly.Extensions package reference to Orleans.Runtime
src/Orleans.Runtime/OrleansRuntimeResiliencePolicies.csNew file defining resilience policies with timeout and retry strategies for placement operations
src/Orleans.Runtime/Configuration/Options/SiloMessagingOptions.csAdds three new configuration properties for placement resilience with defaults
src/Orleans.Runtime/Hosting/DefaultSiloServices.csRegisters Orleans runtime resilience policies during service configuration
src/Orleans.Runtime/Placement/PlacementService.csIntegrates resilience pipeline into placement workflow with message expiration checks and context struct to avoid allocations

Comment on lines +36 to +58
builder
.AddTimeout(newTimeoutStrategyOptions
{
Timeout=options.PlacementTimeout,
OnTimeout= args=>
{
logger.LogWarning("Grain placement operation timed out after {Timeout}.",options.PlacementTimeout);
returndefault;
}
})
.AddRetry(newRetryStrategyOptions
{
MaxRetryAttempts=options.PlacementMaxRetries,
BackoffType=DelayBackoffType.Exponential,
Delay=options.PlacementRetryBaseDelay,
UseJitter=true,
ShouldHandle=newPredicateBuilder().Handle<Exception>(ex=>IsTransientPlacementException(ex)),
OnRetry= args=>
{
logger.LogDebug(args.Outcome.Exception,"Retrying grain placement operation. Attempt: {AttemptNumber}, Delay: {RetryDelay}.",args.AttemptNumber,args.RetryDelay);
returndefault;
}
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The resilience pipeline order is incorrect. The timeout is added before the retry strategy, which means the timeout applies to the entire operation including all retries, not to individual retry attempts. This should be reversed: add retry first, then timeout, so each retry attempt has its own timeout.

builder.AddRetry(newRetryStrategyOptions{ ...}).AddTimeout(newTimeoutStrategyOptions{ ...});

With the current order, ifPlacementTimeout is 30 seconds and retries take 10 seconds each, the operation would fail after 30 seconds even if retries are still within the retry limit.

Copilot uses AI. Check for mistakes.
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That's precisely what we want.

exceptionswitch
{
OrleansException=>true,
TimeoutException=>true,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Retrying onTimeoutException creates a logical conflict. When aTimeoutException is thrown by Polly's timeout strategy, retrying it will likely just hit the same timeout again. This exception should either not be retried, or the timeout configuration needs to be reconsidered (see comment about timeout/retry ordering).

If the intent is to handle timeouts from operationswithin the placement logic (not from Polly's timeout strategy), those should be caught and wrapped in a different exception type to distinguish them from Polly's timeout.

Suggested change
TimeoutException=>true,
TimeoutException=>false,

Copilot uses AI. Check for mistakes.
Comment on lines +381 to +385
if(firstMessage.IsExpired)
{
LogMessageExpiredDuringPlacement(_logger,firstMessage.TargetGrain);
thrownewTimeoutException($"Message expired before placement could complete for grain{firstMessage.TargetGrain}.");
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The message expiration check throws aTimeoutException, but the resilience policy is configured to retryTimeoutException (line 71 of OrleansRuntimeResiliencePolicies.cs). This means expired messages will be unnecessarily retried, which is wasteful since they will fail again immediately.

Consider using a different exception type for message expiration (e.g., a customMessageExpiredException orOperationCanceledException) that won't trigger retries, or exclude this specific case from retry logic.

Copilot uses AI. Check for mistakes.
Comment on lines +408 to +412
if(firstMessage.IsExpired)
{
LogMessageExpiredDuringPlacement(_logger,targetGrain);
thrownewTimeoutException($"Message expired before placement could complete for grain{targetGrain}.");
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Same issue as the first expiration check: thisTimeoutException will be retried by the resilience policy even though the message has already expired. See comment on lines 381-385.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +75
usingSystem;
usingMicrosoft.Extensions.DependencyInjection;
usingMicrosoft.Extensions.Logging;
usingMicrosoft.Extensions.Options;
usingOrleans.Configuration;
usingOrleans.Runtime.Placement;
usingPolly;
usingPolly.Retry;
usingPolly.Timeout;

namespaceOrleans.Runtime;

/// <summary>
/// Resilience policies used by the Orleans runtime.
/// </summary>
publicstaticclassOrleansRuntimeResiliencePolicies
{
/// <summary>
/// The key used to identify the placement resilience pipeline in <see cref="Polly.Registry.ResiliencePipelineProvider{TKey}"/>.
/// </summary>
publicconststringPlacementResiliencePipelineKey="Orleans.Placement";

/// <summary>
/// Adds all Orleans runtime resilience policies to the service collection.
/// </summary>
/// <param name="services">The service collection.</param>
/// <returns>The service collection for chaining.</returns>
internalstaticIServiceCollectionAddOrleansRuntimeResiliencePolicies(IServiceCollectionservices)
{
// Placement resilience pipeline
services.AddResiliencePipeline(PlacementResiliencePipelineKey,static(builder,context)=>
{
varoptions=context.ServiceProvider.GetRequiredService<IOptions<SiloMessagingOptions>>().Value;
varlogger=context.ServiceProvider.GetRequiredService<ILogger<PlacementService>>();

builder
.AddTimeout(newTimeoutStrategyOptions
{
Timeout=options.PlacementTimeout,
OnTimeout= args=>
{
logger.LogWarning("Grain placement operation timed out after {Timeout}.",options.PlacementTimeout);
returndefault;
}
})
.AddRetry(newRetryStrategyOptions
{
MaxRetryAttempts=options.PlacementMaxRetries,
BackoffType=DelayBackoffType.Exponential,
Delay=options.PlacementRetryBaseDelay,
UseJitter=true,
ShouldHandle=newPredicateBuilder().Handle<Exception>(ex=>IsTransientPlacementException(ex)),
OnRetry= args=>
{
logger.LogDebug(args.Outcome.Exception,"Retrying grain placement operation. Attempt: {AttemptNumber}, Delay: {RetryDelay}.",args.AttemptNumber,args.RetryDelay);
returndefault;
}
});
});

returnservices;
}

/// <summary>
/// Determines whether an exception is transient and should be retried during placement operations.
/// </summary>
privatestaticboolIsTransientPlacementException(Exceptionexception)=>
exceptionswitch
{
OrleansException=>true,
TimeoutException=>true,
OperationCanceledException=>false,
_=>false
};
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The new resilience features for grain placement lack test coverage. Consider adding tests for:

  • Retry behavior on transient failures
  • Timeout behavior
  • Interaction between timeout and retry policies
  • Message expiration scenarios during retries
  • Configuration options validation

This is especially important given the complexity of the retry/timeout interaction and the bugs identified in the resilience policy configuration.

Copilot uses AI. Check for mistakes.
Comment on lines +180 to +181
/// This timeout applies to the entire placement operation including retries.
/// If the timeout is exceeded, the placement operation will fail with a <see cref="TimeoutException"/>.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The documentation states "This timeout applies to the entire placement operation including retries," which accurately describes the current implementation but may not be the intended behavior.

With the current pipeline order (timeout before retry), a single timeout covers all retry attempts. This means if retries take longer than the timeout duration, the operation will fail even if individual attempts would have succeeded.

If this is intentional, the documentation is correct. However, if each retry attempt should have its own timeout (which is more typical), this documentation should be updated to reflect that change when the pipeline order is corrected. See the comment on OrleansRuntimeResiliencePolicies.cs lines 36-58.

Suggested change
/// This timeout applies tothe entire placement operationincluding retries.
/// Ifthe timeout is exceeded, the placement operation will fail with a <see cref="TimeoutException"/>.
/// This timeout applies toeach individual placementattempt. If a single attempt exceeds the timeout, it will fail and theoperationwill retry up to <see cref="PlacementMaxRetries"/> times.
/// Ifall attempts fail due to timeout, the placement operation will fail with a <see cref="TimeoutException"/>.

Copilot uses AI. Check for mistakes.
<PackageVersionInclude="OpenTelemetry.Instrumentation.Http"Version="1.9.0" />
<PackageVersionInclude="OpenTelemetry.Instrumentation.Runtime"Version="1.9.0" />
<PackageVersionInclude="Polly"Version="8.5.2" />
<PackageVersionInclude="Polly.Extensions"Version="8.6.4" />

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The package version forPolly.Extensions is 8.6.4 whilePolly is at 8.5.2. This version mismatch could lead to compatibility issues. Both packages should use the same major.minor version to ensure they work correctly together.

Suggested change
<PackageVersionInclude="Polly.Extensions"Version="8.6.4" />
<PackageVersionInclude="Polly.Extensions"Version="8.5.2" />

Copilot uses AI. Check for mistakes.
@ReubenBondReubenBondforce-pushed thefix/connection-resilience/1 branch fromd6580ef tod5457b0CompareNovember 26, 2025 21:26
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

Copilot code reviewCopilotCopilot left review comments

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

@ReubenBond

[8]ページ先頭

©2009-2025 Movatter.jp