Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

chore: Add retry to pipeline templates constructors to add retrier to each pipeline step#179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
ca-nguyen wants to merge2 commits intoaws:main
base:main
Choose a base branch
Loading
fromca-nguyen:add-retry-to-pipeline-test

Conversation

@ca-nguyen
Copy link
Contributor

@ca-nguyenca-nguyen commentedNov 5, 2021
edited
Loading

Description

Fix build failures due to Sagemaker ThrottlingException when running pipeline integration tests

Fixes #(issue) -N/A

Why is the change necessary?

Recent build failures were due to Sagemaker ThrottlingException (Rate exceeded) during following tests:

Solution

Add an optional retry argument to the pipeline template constructors (InferencePipeline andTrainingPipeline) in order to add a retry strategy for each pipeline steps. The same retrier will be added for each step.

Caveat: This fix applies the retry strategy to all steps in the pipeline. The customer won't be able to customize the strategy for each step.

Alternate solution 1:

We could add the option for the client to customize retry strategies for each pipeline step by accepting adict, in addition to acceptingRetry object.

Caveat: The retry strategy dict keys must correspond exactly to the step variable names - A validation step could be added to warn the customer of any unrecognized keys.

For example:

retry_strategy_per_step = {   'training_step': <training_retry_strategy>,   'model_step': <model_retry_strategy>,   'endpoint_config_step': <endpoint_config_retry_strategy>,   'deploy_step': <deploy_retry_strategy>}

If adict is received, only add retriers to steps with defined strategies in that dict.

Alternate solution 2:

Only add retries to integration tests by updating the pipeline workflow with the added retries

# Once pipeline is created do something like:sagemaker_retry_strategy = Retry(    error_equals=["SageMaker.AmazonSageMakerException"],    interval_seconds=5,    max_attempts=5,    backoff_rate=2)steps = pipeline.workflow.definition.branch.stepsfor step in steps:    step.add_retry(sagemaker_retry_strategy)pipeline.workflow.update(Chain(steps))

Caveat: If the fix is only applied to the integration tests, customers who want to add retry strategies to the pipeline steps will have to do this each time they are creating a pipeline

Testing

  • Updated integ test and added unit test
  • Generated doc locally

Pull Request Checklist

Please check all boxes (including N/A items)

Testing

  • Unit tests added
  • Integration test added
  • Manual testing - why was it necessary? could it be automated? -N/A

Documentation

  • docs: All relevantdocs updated
  • docstrings: All public APIs documented

Title and description

  • Change type: Title is prefixed with change type: and followsconventional commits
  • References: Indicate issues fixed via:Fixes #xxx -N/A

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license.

@ca-nguyenca-nguyen marked this pull request as ready for reviewNovember 5, 2021 06:46
@wong-a
Copy link
Contributor

This is a feature, not a chore. Adding new functionality should be motivated from the customer's POV, not just to fix the tests. That being said, I see some value here. Ideally, we should've had retries added by default. But it's still possible to add retriers by updating the Chain directly, right? Do we have any open issues related to this?

It would be nice to add a preconfigured retry strategy like you defined in the tests. It's not uncommon for SDKs to have default and reusable retry strategies. Customers using the pipeline classes probably don't want to deal with much of the lower level ASL constructs.

* (list[`sagemaker.amazon.amazon_estimator.RecordSet`]) - A list of `sagemaker.amazon.amazon_estimator.RecordSet` objects, where each instance is a different channel of training data.
s3_bucket (str): S3 bucket under which the output artifacts from the training job will be stored. The parent path used is built using the format: ``s3://{s3_bucket}/{pipeline_name}/models/{job_name}/``. In this format, `pipeline_name` refers to the keyword argument provided for TrainingPipeline. If a `pipeline_name` argument was not provided, one is auto-generated by the pipeline as `training-pipeline-<timestamp>`. Also, in the format, `job_name` refers to the job name provided when calling the :meth:`TrainingPipeline.run()` method.
client (SFN.Client, optional): boto3 client to use for creating and interacting with the inference pipeline in Step Functions. (default: None)
retry (Retry): A retrier that defines the each pipeline step's retry policy. See `Error handling in Step Functions <https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-retrying-after-an-error>`_ for more details. (default: None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Suggested change
retry (Retry):Aretrierthatdefinestheeachpipelinestep'sretrypolicy.See`Error handling in Step Functions <https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-retrying-after-an-error>`_formoredetails. (default:None)
retry (Retry):Aretrierthatdefinestheretrypolicyforeachstepinthepipeline.See`Error handling in Step Functions <https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-retrying-after-an-error>`_formoredetails. (default:None)

Any reason to not make this a list for multiple retriers?

@StepFunctions-Bot
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-sEHrOdk7acJc
  • Commit ID:aea996c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered bygithub-codebuild-logs, available on theAWS Serverless Application Repository

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@shivlaksshivlaksAwaiting requested review from shivlaks

1 more reviewer

@wong-awong-awong-a left review comments

Reviewers whose approvals may not affect merge requirements

At least 2 approving reviews are required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@ca-nguyen@wong-a@StepFunctions-Bot

[8]ページ先頭

©2009-2025 Movatter.jp