Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork937
Description
Background
CallingGit.execute
—whether directly, or indirectly by calling the dynamic attributes of aGit
instance—and passingkill_after_timeout
with a non-None
value, creates a timer on a separate thread that calls the localkill_process
function. This callback function usesos.kill
to kill the process. Before killing the process, it enumerates the process'sdirect children. If sending the first signal succeeds (basically, if the parent process still existed), it also attempts to kill the child processes.
The children are enumerated withps --ppid
:
Lines 1010 to 1014 infe082ad
p=Popen( | |
["ps","--ppid",str(pid)], | |
stdout=PIPE, | |
creationflags=PROC_CREATIONFLAGS, | |
) |
The problem
The--ppid
option isnot POSIX. Most GNU/Linux systems haveprocps, whoseps
implementation supports--ppid
. I am unsure ifany other implementations ofps
support it. The procps tools generally run only on Linux-based systems, because they use the/proc
filesystem (and assume it is laid out as in Linux). Although they can run on any such system, Alpine Linux and some minimal GNU/Linux environments do not ship them, defaulting tops
frombusybox instead.
As demonstrated below,macOSps
does not support--ppid
. Nor doFreeBSD,NetBSD,OpenBSD, orDragonFly.AIX does not have--ppid
.illumos does not have--ppid
; nor doesSolaris, though-ppid
(with one-
) can be used in 11.4.27 or higher. Although Cygwin mimics Linux where feasible, its/proc
filesystem is different, and itsps
does not support--ppid
either (nor even some important POSIX options like-o
).
The callback parses stdout from thatps
command, but does not examine the exit status or stderr. The effect is that an error message on a system without procps (or anotherps
supporting--ppid
, if there is one) is printed, and the parent process is still sentSIGKILL
, but its children are never found or sent signals.
As detailed below, although thefetch
,pull
, andpush
methods of theRemote
class accept akill_after_timeout
argument, they do not useGit.execute
, so they are unaffected by this bug.
Steps to reproduce
On macOS 13 (on a GitHub Actions CI runner withtmate), I created this script in a directory in$PATH
, named itgit-sleep
, and marked it executable:
#!/bin/shsleep"$@"
Then I calledsleep
on aGit
instance with akill_after_timeout
argument specifying a shorter duration than the sleep:
bash-3.2$ pythonPython 3.12.0 (v3.12.0:0fb18b02c8, Oct 2 2023, 09:45:56) [Clang 13.0.0 (clang-1300.0.29.30)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> from git import Git>>> Git().sleep(10, kill_after_timeout=5)ps: illegal option -- -usage: ps [-AaCcEefhjlMmrSTvwXx] [-O fmt | -o fmt] [-G gid[,gid...]] [-g grp[,grp...]] [-u [uid,uid...]] [-p pid[,pid...]] [-t tty[,tty...]] [-U user[,user...]] ps [-L]Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 741, in <lambda> return lambda *args, **kwargs: self._call_process(name, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 1320, in _call_process return self.execute(call, **exec_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 1117, in execute raise GitCommandError(redacted_command, status, stderr_value, stdout_value)git.exc.GitCommandError: Cmd('git') failed due to: exit code(-9) cmdline: git sleep 10 stderr: 'Timeout: the command "git sleep 10" did not complete in 5 secs.'>>>
Impact
1. Theotherkill_after_timeout
is unaffected
There are two callables defined ingit/cmd.py
that accept an optionalkill_after_timeout
argument: the "internal" top-levelhandle_process_output
function that is not listed in__all__
but is used throughout GitPython, and the publicGit.execute
method (also used when dynamicGit
methods are called). The meaning of this argument is subtly different, and the associated implementations completely different.
This bug affects only the one in theGit
class. Thus it does not affect common uses of timeouts in interacting with remotes: theRemote.fetch
,Remote.push
, andRemote.pull
methods acceptkill_after_timeout
arguments, but they forward them tohandle_process_output
.
2. But this oneshould work on all Unix-like systems
From context, I think it is unintended not to support common Unix-like systems such as macOS. TheGit.execute
docstring says "This feature is not supported on Windows" and makes no other claims about compatibility, from which I think readers will reasonably infer that other platforms are believed supported. When called on a native Windows system (not Cygwin) with a non-None
value forkill_after_timeout
, it raises aGitCommandError
. Other systems, including Cygwin, raise no exception and register thekill_process
callback.kill_after_timeout
is thusin effect documented to work on all systems except native Windows.
3. What happens if the child processes aren't sentSIGKILL
?
I don't know how much of a problem it is forSIGKILL
to be sent only to the parent and not to its direct children. I am not confident I know why that is being done, as opposed to killing only the parent process, or attempting to kill its entire process tree. Myguess is that this is because manygit
commands use a subprocess to do their work. If so, then it may in practice be important—in situations where people passkill_after_timeout
—that the child processes are killed as well.
However,git
subprocesses do sometimes use their own subprocesses:
ek@Glub:~$ pstree -ainit(Ubuntu) ├─SessionLeader │ └─Relay(9) │ ├─bash │ │ └─git clone https://github.com/huggingface/transformers.git │ │ └─git remote-https origin https://github.com/huggingface/transformers.git │ │ └─git-remote-http origin https://github.com/huggingface/transformers.git...
In that example, thegit-remote-http
process may not receiveSIGKILL
. I am unsure how much this matters, but if it is a problem, then the more severe it is, theless severe this bug is, because the intended behavior wouldn't help anyway. Likewise, in situations where killing the parent process is sufficient, this bug also does not cause a problem.
That lower descendants are not killed has been reported as#895. That was observed in GitPython 2.0.2, which had the current approach ofkilling just the direct child processes.
A minor race condition…
One thing I'm a little worried about is a race condition that is currently present, and that I think may not be possible to fix,but that I worry finding child processes in a more portable way may exacerbate. Unless it can be solved or mitigated more deeply, it is a reason, unrelated to performance, to prefer that a portable substitute for the existing use ofps --ppid
not be too much slower than the current way. (I likewise worry that if the approach were changed to kill all descendants, then the added time to traverse the whole subtree might exacerbate this race condition.)
Suppose we plan to kill a process P and all its direct child processes including Q, and we find the PID of Q, but before killing Q, all the following happen:
- Q dies.
- Q is reaped. That is, it iswait(2)ed by its parent--which is either its original parent P or, if P has died, then
init
--causing its entry in the process table to be removed and its PID to be available for use by a future process. - A new process, R, is created and assigned Q's old PID.
Then when when we try to kill Q, we kill R.
This situation is rare, because in practice the time between when a process is reaped and when a new process is given its PID is only shortwhen the process table is nearly full so the kernel has no less recently relinquished PIDs to give out. But I think it would be best to avoid increasing the risk of it.
There may be other related race conditions, but this is the one that seems it could be worsened by replacing the existing unportable use ofps --ppid
with some other technique, if that other technique is markedly slower.
Finding/killing the the subprocesses portably
I am unsure if this should be done, because it is not clear to me that killing the parent process and its direct child processes, as is currently attempted (generally successfully on GNU/Linux and unsuccessfully elsewhere), is necessarily what should happen. Doing anything else might risk incompatibility for some existing use cases on some systems, so I would want to be cautious about doing something altogether different, but I think it should still be considered before proceeding.
However, assuming the current approach of killing the child processes should be preserved, I think there are three cases:
- Systems with
pgrep
/pkill
. - Systems whose
ps
is POSIX-compliant, or at least supports-A
and-o
. - Cygwin. (And the like, e.g., MSYS2. But
sys.platform == "cygwin"
still covers that.)
Case 1 could be folded into case 2 if a speed regression is acceptable (but see above on the race condition), or if testing reveals usingpgrep
orpkill
is not significantly faster. Case 3 could be dropped in lieu of modifying the docstring to document thatkill_after_timeout
is less effective on Cygwin, if reducing complexity is regarded as more important than covering it.
Whether to cover case 3 or not is more a matter of code complexity than time to write and review the code. With or without it, I think most of the time and effort would be on thetests. Currently none cover passingkill_after_timeout
toGit.execute
or to a dynamic method of aGit
object. Only theotherkill_after_timeout
—ofhandle_process_output
—has test coverage. Because this project has CI on Cygwin, I don't think the tests have to do much to accommodate it—its challenges are ready-made.
(An alternative to dealing with these details is to usepsutil, but I'm unsure if the impact of this issue is sufficient to justify adding it as a dependency. It doesn't support all systems, but systems it doesn't support are rare. I think it could be made conditional on the systems it is installable on, and the features that use it be documented as unavailable on other systems. I think this is probably not worth doing just for this, but if it turns out it would help in various other places, and increasing rather than decreasing OS compatibility—as it would here—then it might make sense to consider it. On the other hand, one benefit of GitPython is that it has very few dependencies.)
1. If we havepgrep
/pkill
pgrep
andpkill
are not POSIX, but they are available on many more systems thanps --ppid
. Furthermore, it is likely that all systems that supportps --ppid
also havepgrep
andpkill
, because not only are they very common, but procps (which provides the onlyps
with--ppid
I can find, as discussed above)includes an implementation of them. Of course, it's possible (odd, but possible) for a distribution to use procps forps
but not includepgrep
andpkill
. Whetherpkill
can be used to consolidate the steps, orpgrep
must be used together with something what is already there, is a design decision that should be influenced by a decision about the best order for sendingSIGKILL
.
If it's acceptable to sendSIGKILL
to the child processes first, then instead of running["ps", "--ppid", str(pid)]
and most of what comes after it, one option is to run["pkill", "-P", str(pid)]
and then:
- If it succeeds, immediately call
os.kill(pid, signal.SIGKILL)
andkill_check.set()
. - If it fails due to
pkill
not existing—or if it was checked first and found absent—proceed to case 2 (usingps
).
If it's not acceptable to sendSIGKILL
to the child processes first, or if either order is acceptable but it is desirable to share more code with the fallback case 2, then instead of running["ps", "--ppid", str(pid)]
, run["pgrep", "-P", str(pid)]
, then:
- If it failed due to
pgrep
not existing—or if it was checked first and found absent—proceed to case 2. - Treat an exit status of 0 or 1 as success; 1 is when there were no children. (This seems to hold across different implementations, but I'll want to look into it further, since these tools are not POSIX orXSI.)
- Parse the output—each line is just a PID, with no headers, no other columns—into
child_pids
. - Continue with the rest of the
kill_process
function as it already exists.
2. Ifps
supports-A
and-o
This is almost every Unix-like system used today;POSIX requires these options.
Instead of running["ps", "--ppid", str(pid)]
, run["ps", "-A", "-o", "pid,ppid"]
, then:
- Check that the first row is
PID
andPPID
to safeguard against unexpectedly nonstandardps
. - Each remaining row should have child and parent process IDs. Filter for the rows where the parent (second column) is what we passed, and populate
child_pids
with the PIDs from the first column. - Continue with the rest of the
kill_process
function as it already exists.
If using this is as fast apkill
/pgrep
, or slower but not by a lot, or code simplicity is considered more important than the small worsening of the rare race condition, then this could be used on all systems except Cygwin. The truth is that it is only out of fear of worsening things in weird situations on GNU/Linux systems with procps that I have even proposed case 1. This is the portable way to do it (except Cygwin).
It may be possible to optimize this with-U
to filter the real user ID toos.getuid()
, or-u
to filter the effective user ID toos.geteuid()
, though-u
seems to be an XSI extension. I don't know if this would actually make things faster. I don't know if the added complexity, though modest, would be worthwhile even if it does. When doing this,-A
would not also be passed.
The reason not to simply omit-A
without replacing it, which gets processes that share the caller's EUID, is that it also only shows processes with the same controlling terminal. The reason not to use-a
instead is that it doesn't show processes not associated with any terminal. The reason I prefer-A
to its synonym-e
is that-e
seems to be an XSI extension.
3. Cygwin
Runningps
on Cygwin gives output that looks like:
PID PPID PGID WINPID TTY UID STIME COMMAND 1641 1 1641 27868 ? 197609 Nov 14 /usr/bin/ssh-agent 2201 1 2201 27112 ? 197609 02:35:26 /usr/bin/mintty 2202 2201 2202 41336 pty0 197609 02:35:26 /usr/bin/bash 2304 2202 2304 47276 pty0 197609 14:47:12 /usr/bin/ps
This can be modified by various options, but-o
is not supported. (-A
is not supported either, but it is not needed.)
Instead of running["ps", "--ppid", str(pid)]
, we can run["ps"]
, then:
- Check the first row headers, or at least the leading
PID
andPPID
headers that we are going to use, to safeguard against unexpectedly non-Cygwinps
or future changes to Cygwinps
. - Continue as in case 2 after the check, making sure to use only the first two fields.
Perspective
I think thewhat is more important than thehow, because:
Test coverage
Unlike the otherkill_after_timeout
(inhandle_process_output
), the code path whereGit.execute
is passedkill_after_timeout
has no test coverage. It would be good to test it even if this bug is not fixed. But at the root of both is figuring out if killing the parent process and its direct child processes are what is wanted.
Maintainability
Thekill_process
callback is never called on (native) Windows, where callingGit.execute
with a non-None
value forkill_after_timeout
raisesGitCommandError
. But it contains what seem to be the remains of an attempt to support Windows: itpassesPROC_CREATIONFLAGS
(which is 0except on Windows) when runningps
, and itfalls back tosignal.SIGTERM
whensignal.SIGKILL
is absent (which it is on Windows).
I discovered this whole issue because I want to remove that code, which I think could lead to future bugs, and I was looking into whether there is any reason not to. A possible reason not to is ifkill_process
can be easily modified to support Windows—which it could, if it is acceptable to kill either only the parent process, or the whole process tree, though whether itshould is another question. Because figuring out what to do about this issue entails figuring that out too, it would openkill_process
up to that improvement—dropping its vestigial Windows code if it is not going to support Windows—and possibly others.