Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Surface fetch/pull/push kill_after_timeout and reset default to None#1340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
Byron merged 17 commits intogitpython-developers:mainfromsroet:update_fetch_timeout
Sep 18, 2021

Conversation

sroet
Copy link
Contributor

@sroetsroet commentedSep 10, 2021
edited
Loading

(hopefully)closes#1339
This PR does as ofe058c4c :

  • renamestimeout tokill_after_timeout and set default toNone instead of10 s
  • Surfaceskill_after_timeout to user facing functions (remote. pull/push/fetch)
  • refactorAutoInterrupt so we can._terminate() the underlying process and grab status codes from terminated processes
  • Add a class variableAutoInterrupt._status_code_if_terminate; if non-zero it overrides any status code from the process in_terminate. This can be used to prevent race conditions in tests,as described in this comment

Some history of the commits (while trying a different strategy):

aafb300 changes the default timeout to 60 s
1d26515 allowspush,pull andfetch keywordtimeout that will be propagated down
febd4fe alters some tests to use the old 10 s timeout forfetch andpull
4113d01 also alters a test call topull to add a timeout of 10s
c55a8e3 fixes the test suite
7df33f3 sets the default timeout toNone

if the current default is too lose, users can now give their own timeout

@sroetsroet changed the titleUpdate fetch timeout[WIP] Update fetch timeoutSep 10, 2021
@sroet
Copy link
ContributorAuthor

alright, the test suite runs locally and gives the following coverage:

---------- coverage: platform linux, python 3.7.11-final-0 -----------Name                                Stmts   Miss  Cover-------------------------------------------------------git/__init__.py                        42      8    81%git/cmd.py                            525     98    81%git/compat.py                          46     15    67%git/config.py                         439     48    89%git/db.py                              27      1    96%git/diff.py                           268     10    96%git/exc.py                             63      3    95%git/index/__init__.py                   2      0   100%git/index/base.py                     498     32    94%git/index/fun.py                      179      7    96%git/index/typ.py                       68      3    96%git/index/util.py                      50      5    90%git/objects/__init__.py                13      0   100%git/objects/base.py                    91     10    89%git/objects/blob.py                    17      0   100%git/objects/commit.py                 276     28    90%git/objects/fun.py                    101      5    95%git/objects/submodule/__init__.py       0      0   100%git/objects/submodule/base.py         583     46    92%git/objects/submodule/root.py         137     12    91%git/objects/submodule/util.py          51      9    82%git/objects/tag.py                     47     12    74%git/objects/tree.py                   181     73    60%git/objects/util.py                   232     32    86%git/refs/__init__.py                    6      0   100%git/refs/head.py                       98      6    94%git/refs/log.py                       150     10    93%git/refs/reference.py                  55      3    95%git/refs/remote.py                     32      2    94%git/refs/symbolic.py                  338     18    95%git/refs/tag.py                        47      6    87%git/remote.py                         427     55    87%git/repo/__init__.py                    1      0   100%git/repo/base.py                      552     36    93%git/repo/fun.py                       201     20    90%git/types.py                           44     10    77%git/util.py                           539    121    78%-------------------------------------------------------TOTAL                                6426    744    88%Results (64.31s):     378 passed      42 skipped

@sroetsroet changed the title[WIP] Update fetch timeoutUpdate fetch timeoutSep 10, 2021
@sroetsroet changed the titleUpdate fetch timeoutSurface fetch/pull/push timeout and set default to 60 sSep 10, 2021
@ByronByron requested a review fromYobmodSeptember 10, 2021 14:32
@ByronByron added this to thev3.1.24 - Bugfixes milestoneSep 10, 2021
@Byron
Copy link
Member

Thanks a lot for the initiative. With the offending release yanked I believe we win some time to wait for@Yobmod to chime in on this.

My preference, if I understand what's happening correctly, is to set the default timeout toNone and not alter the previous behaviour, as the code as is can't tell if it's just a long-running process or a stalled one. Thus I would prefer offering this up to callers only.

@Yobmod
Copy link
Contributor

Yobmod commentedSep 10, 2021
edited
Loading

Hi, I think setting the timeout default to None works better, rather than a larger number.
Thread.join(None) would then wait as long as needed (I think?), but if people want a timeout, it's still there to use.

Or it could be removed altogether.
It got added when the tests started hanging in a 3.10 beta. I couldn't find the problem, but it doesn't happen with release candidates any more.

@dwhswenson
Copy link

FWIW, the one concern I have about surfacing the argument here is that I think there might be confusion betweencmd.execute_kwargs['kill_after_timeout'] and thistimeout. As a user, without a careful read of the docs I would expect the behavior ofremote.fetch(timeout=60) to be that of'kill_after_timeout'.

@sroetsroet changed the titleSurface fetch/pull/push timeout and set default to 60 sSurface fetch/pull/push timeout and reset default to NoneSep 10, 2021
@sroet
Copy link
ContributorAuthor

@Yobmod,@dwhswenson ; I have reverted the default to None, please let me know what you think.

@Byron, I will not be able to push to this branch for the next ~60 hour, so feel free to push to it if you want this to be finished up before monday

@Byron
Copy link
Member

@sroet Thanks for your help with this and the incredibly quick turnaround time, and for everyone to share their thoughts to make this better.

FWIW, the one concern I have about surfacing the argument here is that I think there might be confusion betweencmd.execute_kwargs['kill_after_timeout'] and thistimeout. As a user, without a careful read of the docs I would expect the behavior ofremote.fetch(timeout=60) to be that of'kill_after_timeout'.

To me it seems thatkill_after_timeout could have been used to handle this all along, as killing the process the threads are reading from should break their pipes and thus stop the read-loops, allowing them to stop naturally. Since we have tests for this maybe you could test this hypothesis. Indeed it would be great to avoid two timeout parameters who have the same intention, but only one of them seems to be solving the issue at its root.

The actual process terminationhappens here and it looks like it will succeed on non-windows but might still fail to kill stalled processes on windows. To my mind, that's acceptable and I'd rather wait for users to report issues due to this than to keep thetimeout parameter around just for the case the git process isn't responding to signals.

@sroet
Copy link
ContributorAuthor

This took a bit more work than expected, but should now work with the keywordkill_after_timeout for the user facing functions

The actual process termination happens here and it looks like it will succeed on non-windows but might still fail to kill stalled processes on windows. To my mind, that's acceptable and I'd rather wait for users to report issues due to this than to keep the timeout parameter around just for the case the git process isn't responding to signals.

That did not really work out, as most user facing functions (fetch,pull, andpush) setas_process=True so that code path is never hit, insteadthis if statement is hit which returns anAutoInterrupt object. Looking atthat class it actually knows how to kill underlying processes (in the__del__ method) so I decided to factor that code out to a private function_terminate (and fixingwait ifself.proc=None).

To me it seems that kill_after_timeout could have been used to handle this all along, as killing the process the threads are reading from should break their pipes and thus stop the read-loops, allowing them to stop naturally.

The reading threads had no way of handling the closing of the streams, so added a check to the exception.

It should now raises like this:

>>>a=git.Repo('.')>>>a.remotes.origin.fetch(kill_after_timeout=0.01)Traceback (mostrecentcalllast):File"<stdin>",line1,in<module>File"/home/sander/github_files/GitPython/git/remote.py",line868,infetchkill_after_timeout=kill_after_timeout)File"/home/sander/github_files/GitPython/git/remote.py",line732,in_get_fetch_info_from_stderrproc.wait(stderr=stderr_text)File"/home/sander/github_files/GitPython/git/cmd.py",line490,inwaitraiseGitCommandError(remove_password_if_present(self.args),status,errstr)git.exc.GitCommandError:Cmd('git')faileddueto:exitcode(-15)cmdline:gitfetch-voriginstderr:'error: process killed because it timed out. kill_after_timeout=0.01 seconds'

When writing the tests, I saw that the error handling code ofpush does something different than forfetch andpull

  • fetch/pull always error out if there is any error lines instd_err(it does not care if there is output as that is piped toNone)
  • push only errors out if there are error lines instd_err andno output tostd_out

@Byron Is this mismatch intentional/wanted?

I will update the initial post to indicate the current status

@sroetsroet changed the titleSurface fetch/pull/push timeout and reset default to NoneSurface fetch/pull/push kill_after_timeout and reset default to NoneSep 13, 2021
@Byron
Copy link
Member

Thanks a million for all the effort you are putting in!

@Byron Is this mismatch intentional/wanted?

It's based on the observation on how these processes usually work and communicate errors. However, if it feels out-of-place or wrong there is no reason not to improve it while you are here. It wouldn't have been the first time that there are long-standing issues in the code-base that somehow never surfaced. While I encourage you for these fixes I also warn that the test-suite might not catch all the subtle behaviour either, so it's probably better to leave as much unchanged as possible while fixing the issue at hand.

Note that CI is failing because of the linter and white-space related errors. It's strange that it asks me to approve running CI on this PR over and over again and I hope it will just run when the fix is pushed.

@dwhswenson
Copy link

It's strange that it asks me to approve running CI on this PR over and over again and I hope it will just run when the fix is pushed.

Seehttps://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/, in particular:

Once a collaborator with write access has reviewed the changes, they can approve the workflows to run for the current commit only. If the pull request is updated to a new commit, a new approval will be required.

IMO, that last is a bad choice by GitHub. I understand the desire to avoid abuse by crypto-miners, but there should be an option for the approval to be for the duration of the PR.

@Byron
Copy link
Member

Thanks for the heads-up, I wasn't aware of this change.

[…] but there should be an option for the approval to be for the duration of the PR.

Couldn't agree more. Here it seems they are over-steering towards abuse-avoidance and trade convenience of using their platform. Instead, if legitimate PRs turn out to become abusive later, it's something I'd happily report - after it happened.
Maybe additional options will be implemented some day, I do have hope.

@sroet
Copy link
ContributorAuthor

@bryon, sorry, I should have checked mypy and flake8 localy. they now return:

(gitpy) ***@***:~/github_files/GitPython$ flake80(gitpy) ***@***:~/github_files/GitPython$ mypy -p gitSuccess: no issues found in 37 source files

@sroet
Copy link
ContributorAuthor

Looking at the test failure for9678226, it seems like the CI environment is too quickfetching after doing apull to reliably triggering the case, only doing a pull should still hit most codepaths...

@sroet
Copy link
ContributorAuthor

Hmm, it started failing locally with timeout=0 , so reset it to a small value. Hopefully this should be small enough to at least let CI fail out, otherwise I would like a reviewer to see if the test also fails on their local system...

@sroet
Copy link
ContributorAuthor

sroet commentedSep 14, 2021
edited
Loading

A bit of info on why the error is not raised but still shows up in the log as:

WARNING  git.remote:remote.py:734 Error lines received while fetching: error: process killed because it timed out. kill_after_timeout=0 seconds

This is because there is a race condition: in the time it takes between the main thread triggering the timeout (which gives the log) and telling the process to terminate (which should trigger the error by setting the status to not 0), the process can complete correctly. In this case, I decided of letting the code recover instead of erroring out.

@Byron
Copy link
Member

It's interesting as CI still fails even though the timeout is small. Can there be a 'too-small' due to same race elsewhere?

Maybe the CI is onto something here and indicates some fixable issue, and for now I'd be inclined to believe that.

@sroet
Copy link
ContributorAuthor

sroet commentedSep 14, 2021
edited
Loading

@Byron the issue lies in the fact that the CI python thread is too slow compared to the git process, I can force CI behavior locally by altering the lines:

fortinthreads:t.join(timeout=kill_after_timeout)ift.is_alive():ifisinstance(process,Git.AutoInterrupt):process._terminate()

to

fortinthreads:t.join(timeout=kill_after_timeout)ift.is_alive():ifisinstance(process,Git.AutoInterrupt):#TODO debug codeimporttimetime.sleep(1)process._terminate()

What happens is that inside theAutoInterrupt._terminate() function the following lines exist:

# did the process finish already so we have a return code ?try:ifproc.poll()isnotNone:self.status=proc.poll()returnNone

which allows the process status to be0 even if the timeout is triggered initially.

and even later it still does:

            # try to kill it                                                                try:                                                                                proc.terminate()                                                                self.status = proc.wait()    # ensure process goes away

Which also allows the process to terminate gracefully if the kill signal is not fast enough.

Now, I can force setself.status to-15 inside the_terminate function which make sure we always trigger the error whenever_terminate() has been called, even if the process completed between python triggering a timeout and getting around to actually kill the process.

@Byron is that (not allowing the process to recover) the preferred solution here?

[EDIT]: addition; The GitCommandError is only raised (insidewait()) ifAutoInterrupt.status != 0

@Byron
Copy link
Member

@Byron is that (not allowing the process to recover) the preferred solution here?

A robust implementation is preferred any time, and it seems like a system that can recover to states that are changing is the better option.

Ultimately CI should work and ideally work for the next few years without spurious failures due to races.

@sroet
Copy link
ContributorAuthor

Ultimately CI should work and ideally work for the next few years without spurious failures due to races.

Added a class variable toAutoInterrupt that sets the statuscode inside_terminate if it non-zero, and updated the tests to use it. It makes the test pass locally, with thesleep(1). So I am pretty confident it catches the race condition in tests and still allows for the robust implementation, please have another look

@ByronByron self-requested a reviewSeptember 15, 2021 23:41
Copy link
Member

@ByronByron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Fantastic, it looks like this PR is ready to be merged. I'd love to have another reviewer though, maybe@Yobmod could take a final look?

I'd be happy to try a new release when merged, too.

@ByronByron merged commitaa50766 intogitpython-developers:mainSep 18, 2021
@per1234
Copy link

Maybe additional options will be implemented some day, I do have hope.

@Byron they already have been:
https://github.blog/changelog/2021-07-01-github-actions-new-settings-for-maintainers/
Switching to the "Require approval for first-time contributors who are new to GitHub" setting has almost completely eliminated the need to approve workflows in the repositories I maintain.

dwhswenson reacted with thumbs up emojiByron reacted with heart emoji

@Byron
Copy link
Member

Screenshot 2021-09-20 at 20 30 23

Thank you sooo much! That drove me crazy 🤦‍♂️

@sroetsroet deleted the update_fetch_timeout branchSeptember 20, 2021 12:31
@sroet
Copy link
ContributorAuthor

sroet commentedSep 20, 2021
edited
Loading

Thank you sooo much! That drove me crazy

Can imagine, thanks for dealing with the CI on this one!

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@ByronByronByron approved these changes

@YobmodYobmodAwaiting requested review from Yobmod

Assignees
No one assigned
Labels
None yet
Development

Successfully merging this pull request may close these issues.

default thread timeout too short for CI
5 participants
@sroet@Byron@Yobmod@dwhswenson@per1234

[8]ページ先頭

©2009-2025 Movatter.jp