Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Retry attempts that fail due to a connection timeout#24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
susodapop merged 23 commits intomainfromretry-connection-attempt-failures
Aug 5, 2022

Conversation

@susodapop
Copy link
Contributor

@susodapopsusodapop commentedJul 28, 2022
edited
Loading

Note for reviewers

On 4 August 2022 I reverted all changes in this PR so I could reimplement and apply all your review feedback. This happened infa1fd50. Every subsequent commit encapsulates one logical change to the code. Working through them one-at-a-time should be quite easy.

  1. Isolate delay boundingbf65b81
  2. Move error details scope up one levela0d340e (H/T@moderakh )
  3. RetryGetOperationStatus forOSErrora55cf9d (thanks@sander-goos and@benfleis )
  4. Log when we do it38411a8 (H/T@benfleis )

The final four commits are simple code cleanup and fixing one warning in the test suite un-related to this change.

Since our e2e tests are not enabled via Github Actions yet I ran them locally and all passed.

Description

Currently, we only retry attempts that returned a code 429 or 503 and include a Retry-After header. This pull request also allowsGetOperationStatus requests to be retried if the request fails with anOSError exception. The configured retry policy is still honoured with regard to maximum attempts and max retry duration.

Background

The reason we only retry 429 and 503 responses today is because retries must be idempotent. Otherwise the connector could cause unexpected or harmful consequences (data loss, excess resource utilisation etc.)

We know that 429/503 responses are idempotent because the attempt was halted before the server could execute it, regardless if the attempted operation was itself idempotent.

We also know thatGetOperationStatus requests are idempotent because they do not modify data on the server. Therefore we can add an extra case to our retry allow list:

  • If a request is aGetOperationStatus command
  • If it fails due to anOSError such asTimeoutError orConnectionResetError.

Previously we attempted this same behaviour by retryingGetOperationStatus requests, regardless the nature of the exception. But this change could not pass our e2e tests because there are valid cases whenGetOperationStatus will raise an exception from within our own library code: for example, if an operation is canceled in a separate threadGetOperationStatus will raise a "DOES NOT EXIST" exception.

Logging Behaviour

The connector will log whenever it retries an attempt because of anOSError. It will use log levelINFO if theOSError is one we consider normal. It will use log levelWARNING if theOSError seems unusual. The codes we consider normal are:

# | Debian | Darwin |info_errs= [# |--------|--------|errno.ESHUTDOWN,# |   32   |   32   |errno.EAFNOSUPPORT,# |   97   |   47   |errno.ECONNRESET,# |   104  |   54   |errno.ETIMEDOUT,# |   110  |   60   |                    ]

The full set ofOSError codes is platform specific. I wrote this patch to target a specific customer scenario whenGetOperationStatus requests were retried after an operating system socket timeout exception. In this customer scenario the error code wasErrno 110: Connection timed out. However that error code is specific to Linux. On a Darwin/Macos host the code would be65 and on Windows it would be10060.

Rather than catch these specifically, I use Python'serrno built-in to check forerrno.ETIMEDOUT which will resolve to the platform-specific code at runtime. I tested this manually on Linux and MacOS (but not on Windows).@benfleis helped me pick what we consider "normal".

We log all otherOSError codes withWARNING is because it would be pretty unusual for a request to fail because of a "FileNotFound" or a system fault.

References

I foundthis article extremely helpful while formulating this fix. The author is solving a very similar problem across platforms.

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
It doesn't like retry policy bounds == None.Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Other OSError's like `EINTR` could indicate a call was interrupted afterit was received by the server, which would potentially not be idempotentSigned-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Only make it non-null for retryable requestsSigned-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
every time. Whereas the previous approach they passed in ten seconds.Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Copy link
Contributor

@sander-goossander-goos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Left some comments. Could you please link in the description the resources that assures that the errors we catch are safe to retry (and platform independent).

)


withself.assertRaises(OperationalError)ascm:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can be more specific to assert RequestError

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I actually removed the e2e tests for this behaviour because they were no more useful than unit tests. I'll add an e2e test for this scenario after we merge support for http proxies. Then we can simulate real timeouts.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In the unit tests asserts onRequestError btw.

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Add retry_delay_default to use in this case.Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Emit warnings for unexpected OSError codesSigned-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
DeprecationWarning: The 'warn' function is deprecated, use 'warning' insteadSigned-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
@susodapop
Copy link
ContributorAuthor

@benfleis@sander-goos I updated the PR description to reflect the present state after this week's reviews / updates.

Copy link
Contributor

@benfleisbenfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

looks good!

Which real connection tests did you perform to validate? Related, perhaps worth adding a note to PR (or near the retry code) explaining any manual E2E tests you used to validate behavior in "real conditions", so you (and others) won't have to think so hard about what's worth doing as a 1-off manual test.

susodapop reacted with rocket emoji
Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
@susodapop
Copy link
ContributorAuthor

Which real connection tests did you perform to validate?

I usedmitmweb (morehere) to intercept traffic between the connector and a SQL Warehouse.mitmweb allows me to force a timeout and abort a connection. Both of which raiseOSError in Python. The expected behaviour was met:GetOperationStatus requests were retried within the retry policy, but other requests were not such asExecuteStatement,FetchResults,OpenSession, andCloseSession.

This process is very manual, so I'm not yet incorporating it into thetests/e2e/driver_tests.py file. I will follow-up with an outline of how to automate this testing in the future.

How I ran the tests

The actual integration test I ran looked like this:

deftest_make_request_will_retry_GetOperationStatus(self):importthrift,errnofromdatabricks.sql.thrift_api.TCLIService.TCLIServiceimportClientfromdatabricks.sql.excimportRequestErrorfromdatabricks.sql.utilsimportNoRetryReasonfromdatabricks.sqlimportclientfromdatabricks.sql.thrift_api.TCLIServiceimportttypeswithself.cursor()ascursor:cursor.execute("SELECT 1")op_handle=cursor.active_op_handlereq=ttypes.TGetOperationStatusReq(operationHandle=op_handle,getProgressUpdate=False,    )EXPECTED_RETRIES=2withself.cursor({"_socket_timeout":10,"_retry_stop_after_attempts_count":2})ascursor:_client=cursor.connection.thrift_backend._clientwithself.assertRaises(RequestError)ascm:breakpoint()# At this point I instructed mitmweb to intercept and suspend all requestscursor.connection.thrift_backend.make_request(_client.GetOperationStatus,req)self.assertEqual(NoRetryReason.OUT_OF_ATTEMPTS.value,cm.exception.context["no-retry-reason"])self.assertEqual(f'{EXPECTED_RETRIES}/{EXPECTED_RETRIES}',cm.exception.context["attempt"])

I used this Dockerfile to make this work on my local machine

FROM python:3.7-slim-busterRUN useradd --create-home pysql# Ubuntu packagesRUN apt-get update && \  apt-get install -y --no-install-recommends \    curl \    gnupg \    build-essential \    pwgen \    libffi-dev \    sudo \    git-core \# Additional packages required for data sources:    libssl-dev \    libsasl2-modules-gssapi-mit && \    apt-get clean && \     rm -rf /var/lib/apt/lists/*RUN sudo mkdir /usr/local/share/ca-certificates/extraRUN curl mitm.it/cert/pem > mitmcert.pemRUN openssl x509 -in mitmcert.pem -inform PEM -out mitm.crtRUN sudo cp mitm.crt /usr/local/share/ca-certificates/extra/mitm.crtRUN sudo update-ca-certificatesRUN pip install poetry --userCOPY --chown=pysql . /pysqlRUN chown pysql /pysqlWORKDIR /pysqlRUN python -m poetry install

And configured Docker using itsconfig.json file to forward all traffic from my containers through themitmweb proxy that I installed on my local environment, like this:

"proxies":    {"default":      {"httpProxy":"http://<my local network ip address>:8080","httpsProxy":"http://<my local network ip address>:8080","noProxy":"pypi.org"      }    }

Themitmweb proxy must be running when building the Docker container.

Related, perhaps worth adding a note to PR (or near the retry code) explaining any manual E2E tests you used to validate behavior in "real conditions", so you (and others) won't have to think so hard about what's worth doing as a 1-off manual test.

Done here10016ea

Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
This reverts commit4db4ad0.Signed-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
@susodapopsusodapop merged commitc59c393 intomainAug 5, 2022
@susodapopsusodapop deleted the retry-connection-attempt-failures branchAugust 5, 2022 21:23
@sander-goos
Copy link
Contributor

Thanks for the fix and the detailed explanation!

moderakh pushed a commit to moderakh/databricks-sql-python that referenced this pull requestAug 24, 2022
* Isolate delay bounding logic* Move error details scope up one-level.* Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case.* Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes* Update docstring for make_request* Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' insteadSigned-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
saishreeeee pushed a commit that referenced this pull requestJun 4, 2025
* Isolate delay bounding logic* Move error details scope up one-level.* Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case.* Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes* Update docstring for make_request* Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' insteadSigned-off-by: Jesse Whitehouse <jesse@whitehouse.dev>
saishreeeee pushed a commit that referenced this pull requestJun 4, 2025
* Isolate delay bounding logic* Move error details scope up one-level.* Retry GetOperationStatus if an OSError was raised during execution. Add retry_delay_default to use in this case.* Log when a request is retried due to an OSError. Emit warnings for unexpected OSError codes* Update docstring for make_request* Nit: unit tests show the .warn message is deprecated. DeprecationWarning: The 'warn' function is deprecated, use 'warning' insteadSigned-off-by: Jesse Whitehouse <jesse@whitehouse.dev>Signed-off-by: Sai Shree Pradhan <saishree.pradhan@databricks.com>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@moderakhmoderakhmoderakh left review comments

@yunbodeng-dbyunbodeng-dbyunbodeng-db approved these changes

@arikfrarikfrAwaiting requested review from arikfr

@sander-goossander-goosAwaiting requested review from sander-goos

+1 more reviewer

@benfleisbenfleisbenfleis approved these changes

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

6 participants

@susodapop@sander-goos@benfleis@moderakh@yunbodeng-db

[8]ページ先頭

©2009-2025 Movatter.jp