- Notifications
You must be signed in to change notification settings - Fork380
ExitChecker: add lnx/bsd single-pid exit watchers#4600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
a9fa248 to5b13bddCompareCould you let me know when this this draft PR stabilizes and gets re-based? Thanks. No rush, I have at least two reported SN defects which will a significant amount of time. Sorting out I gave this PR an initial read or three. I'd like to be able to study a 'near final' version and not have Knowing that such a pause will exist will allow me to focus on exploring some reports that would show that |
2579787 toe31af30Compare12b396f tod604f60Compare@LeeTibbert as requested, this one is ready to review. cc@WojciechMazur Please review just the last commit, as the PR contains#4622 and#4623 and is to be merged after them. |
unit-tests/shared/src/test/scala/org/scalanative/testsuite/javalib/lang/ProcessTest.scala OutdatedShow resolvedHide resolved
Uh oh!
There was an error while loading.Please reload this page.
Thank you for the heads up and pause. I am beginning to review this PR for concept. I'll focus on the Delay may work in my/our favor. The required prerequsite PR may have been I suspect that this is a major architectural change and deserves close review No disrespect. |
| val factoryOpt: Option[Factory] = | ||
| if (LinktimeInfo.isWindows) None | ||
| else if (LinktimeInfo.is32BitPlatform) None // TODO: wonder why |
LeeTibbertNov 3, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
good question in TODO
I have not traced far enough to determine what 'None'
does. I'll keep my eyes open for what the caller does.
I think a return value of None does not mean the obvious
"No watcher needed' but rather "use traditional, since replaced, C++ watcher".
I believe that was the intent of the, say 0.5.9 code. I hope
the code actually implemented that intent.
I never took the time to determine if 32 bit Linux had pidfd.
I do know that I did not have access to a machine where
I could test it. No X32 cases in CI in those days.
So, Ichickened out followed a conservative path
and kept it using the traditional watcher.
Only you know if making changes for 32 bit systems is worth your time. My suggestion is that it is not economic,
at least until there is some known defect specific to those machines.
I checked the multiarch .yml and the 64 & 32 bit tests
use Java 17. To check the bookkeeping, that means
thatProcessTestOnJDK9 should be running.
When I tried the experiment of doing away with a Watcher
altogether, those were the tests which showed that
some kind of Watcher is needed.
The fact that X86 tests are passingProcessTestOnJDK9 leads, weakly, to the conclusion
thatsome kind of Watcher must be running there.
| privatevalhasProcessesToWatch= lock.newCondition() | ||
| defwatchForTermination(handle:GenericProcessHandle):Unit= { | ||
| processes.put(handle.pid(), handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Not touched by this PR but worth mentioning
in case it is touched in the future.
Putting the process.put() after the isAlive() would
save effort in the hopefully rare case that the watcherThread is not alive,
| if (pidfd < 0) None else Some(pidfd) | ||
| } | ||
| /* We use ppoll here as an epoll descriptor may not be persisted; if multiple |
LeeTibbertNov 4, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
When dealing with one pidfd, ppoll() is much lighter weight and almost certainly takes fewer CPU cycles. I say
'almost certainly' because I do not have numbers and
I have not traced to execution path to see if ppoll is implemented in terms of epoll.
The ppoll() call does not change the signal mask. Time
to date has shown that to not be necessary.
I believe the description described in the rest of the call
may well be misdirection.
Earlier this week a contributor explained to me
the path from the pidfd thru to the Linux internal
reference. All my reading about epoll() indicates
that multiple calls to epoll on the same pidfd will
be notified when the pidfd state changes. That is
all epoll waiting on the same internal slot.
Now, I have not run a program to test this out and
I have not traced Linux internals.
You probably would not have entered the comment
if you did not have a good reason.
Describing the lack of epoll() answers the question that
a reader might have before they formulate it. They may
be sufficient.
The epoll description may be somewhere between too-much-information (the devo of 10,000 comments) says
or information pending validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't know how epoll behaves internally, can go by two points:
- one is what ChatGPT keeps telling me; I find that if the LLM persists despite constant challenges and disbelief, it tends to be right
- for instance, it kept insisting that some code in MacOS's implementation of
keventmakesNOTE_EXITSTATUScompletely unreliable, and I had to try it myself before running into problems and agreeing - in this case, it is firm in that both epoll_wait and kevent are not guaranteed to be woken up by the kernel if there are other waiters for the same event
- for instance, it kept insisting that some code in MacOS's implementation of
- two is this: this is not the last commit in my sequence, and there are a few more leading to an exit watcher which persists kqueue (bsd) and epoll (linux) descriptors (and also allows watching for multiple processes if desired)
- when I switched ProcessHandle.waitFor implementation to use this persisted-state exit watcher, CI tests started failing immediately, hanging for 45 minutes and then getting killed
- in case of BSD, the only difference was persisted kqueue
- in case of Linux, it was epoll_create+epoll_ctl (persisted) and epoll_wait, instead of ppoll
- if LLM is to be believed, GenericProcessWatcher and unit test calling waitFor would clash and one of them would block (it doesn't even matter which one)
so, while I can't state unequivocally that persisting epoll is definitely a problem, there's a non-zero probability that using it in this particular manner might be.
LeeTibbert commentedNov 4, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I have spent at least two long, rested sessions examining this PR and have reached I believe that I am looking a the most current code. Hard to tell for sure. Let me speak plainly with the intent of helping the dialectic, not offending, but pick my At the line-of-code level I saw an attention to detail that made it believably correct. I believe that is is necessary to rise above the line-by-line and ask two questions
I do agree that the pid waiting & reaping code as of Nov 1 would benefit from improvement, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
in addition to the line comment below, here is my take on the other comment, about design.
perhaps it would help if renamed this EventWatcher as something else, maybe that's the confusion. all it does is allows querying if some process (or the process, for single-pid watchers) has exited. then the process handle uses it to implement its waitFor methods. that's all.
the old bsdWaitForImpl has become EventWatcherFreeBSD.Single.waitAndReapSome, and the old linuxWaitForImpl is now EventWatcherLinux.Single.waitAndReapSome.
here are the next steps, if this ever gets merged:
- unify Gen1 and Gen2:
- implement any-process EventWatcher using waitpid(-1), for Gen1
- move Gen2 construction complexity into a factory
- now that we have a factory to create UnixProcess and an EventWatcher to monitor its completion, unify Gen1 and Gen2
- watch multiple processes at once
- implement a multi-process EventWatcher (persisted kqueue or epoll set with multiple pids), for Gen2
- sincewe think that a persisted kqueue/epoll cannot be used with multiple waiters, use it exclusively in GenericProcessWatcher (where we are guaranteed to call one at a time)
- possibly better than ppolling/keventing one process at a time
| if (pidfd < 0) None else Some(pidfd) | ||
| } | ||
| /* We use ppoll here as an epoll descriptor may not be persisted; if multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don't know how epoll behaves internally, can go by two points:
- one is what ChatGPT keeps telling me; I find that if the LLM persists despite constant challenges and disbelief, it tends to be right
- for instance, it kept insisting that some code in MacOS's implementation of
keventmakesNOTE_EXITSTATUScompletely unreliable, and I had to try it myself before running into problems and agreeing - in this case, it is firm in that both epoll_wait and kevent are not guaranteed to be woken up by the kernel if there are other waiters for the same event
- for instance, it kept insisting that some code in MacOS's implementation of
- two is this: this is not the last commit in my sequence, and there are a few more leading to an exit watcher which persists kqueue (bsd) and epoll (linux) descriptors (and also allows watching for multiple processes if desired)
- when I switched ProcessHandle.waitFor implementation to use this persisted-state exit watcher, CI tests started failing immediately, hanging for 45 minutes and then getting killed
- in case of BSD, the only difference was persisted kqueue
- in case of Linux, it was epoll_create+epoll_ctl (persisted) and epoll_wait, instead of ppoll
- if LLM is to be believed, GenericProcessWatcher and unit test calling waitFor would clash and one of them would block (it doesn't even matter which one)
so, while I can't state unequivocally that persisting epoll is definitely a problem, there's a non-zero probability that using it in this particular manner might be.
Expand `pidfd` field in UnixProcessGen2 to take a factory object whichinitializes an instance which can be used to query this process forcompletion.
renamed as ExitChecker, to remove confusion with GenericProcessWatcher, as the watcher has an active thread while this new checker doesn't do anything until asked, so it provides ability to check for exit. |
It is nice to see the sketch of a development plan. I think this dialogue is moving forward. To try to have some kind of order, let me Present.If you say all of this is going to change and soon, perhaps you might think of leaving this Yes, there is an overhead to holding lots of 'pending' files in current Work-In-Progress |
i must be in the minority who thinks that the proposed state is a bit clearer, with a few design patterns deployed to add clear division of labor, as well as remove several if-else and throw platform-dependent statements. |
LeeTibbert commentedNov 5, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
| or instance, it kept insisting that some code in MacOS's implementation of kevent makes NOTE_EXITSTATUS completely I've encountered this statement before. It sounds like a macOS 10.n, where n is something like 5, issue. Agreed we need to be on the look out and probably exercise these. On a closely related topic. More and more in my reading about If you are going where I think you are going, a single os waiter, this may not be a concern. I think eventfd was specifically designed to wake up all waiters, so the close there may not I mention the the 'pidfd' possible behavior because I had described allocating one and Somewhere down my interrupt stack and getting further down by the day, another
|
LeeTibbert commentedNov 5, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
FutureCode simplificationUnifying Gen1 and Gen2 to the greatest extent sensible is certainly a good goal.
Process waiting architecture.
I suggest this is the crux. I also think that the JVM specification does us no Let me describe two contrasting models and apply some math. Yes, I am skipping over For analysis lets identify two models or design architectures for how process reaping happens One model can be described as "Watcher does most of the work", say "Load the Watcher". Another model can be described as "Processes do the majority of the waiting and reaping I'll assume an event driven Watcher. The removal sequential scan of Processes being watched is I think this is the key area needing both design studies an analysis. If there the parent is not in a waitFor, the Watcher reaps the process. If the process is in some for of waitFor, there is a race to reap the child and I think advantage in reduced code complexity goes to the "Watcher does most of work" model I will also note that probably 95+ of current and near term uses will be of One Watcher, One Process, two waitForNow expand into the case where there is one Watcher, one Process, its child, and Now there are two chances that the slightly more runtime efficient process reaping will happen. And one chance that the Watcher will have to spend time reaping the child. There are One Watcher, Two Process each with a waitFor.For the sake of analysis, consider the case where the child process for each of the Processes Now there are two chances that the slightly more runtime efficient process reaping will happen. And one chance that the Watcher will have to spend time reaping the children. Time spent If each Process itself has several threads in waitFor, the chances for a low latency Theoretical advantage to 'Load the processes" model. One Watcher, Large N Process, each with possibly more than pending 'waitFor'Consider an extreme or upper bound Pandemonium case. Will SN ever actually encounter this? Now there are large N+ chances that the slightly more runtime efficient process reaping will happen. And one chance that the Watcher will have to spend time reaping the children. If the Watcher wins By far and away, advantage goes to the "load the Processes model" AlternativesA trial design study implementation may be able to remove the "Watcher reaps" and attendant Measuring theoryTheory allows one to make informed guesses, but how do we measure the slow points of SummationIn the "single Watcher, single Process, single waitFor() or not" case doing Amortized over current and foreseeable work loads, the "Load the Processes" is highly The test of skill in either model, and probably in any related model, is if the irreducible Granted that having a black belt is the price of entry, does one need a first degree black |
| must be in the minority As far as I can tell, there are only two voices so far, so equality, not minority. I do not have either the desire or the standing to be Horatius Cocles, who I do not doubt for a moment l that there is point goodness here which should be captured here and would |
It is off-top from this PR but does pertain to the 'future work' sketch presented earlier to I propose to do a private design study using the current (Nov 5) mainline code and If that study fails CI or does not show a reduction of number of Processes If shows what I think it will, I could submit a Draft PR for discussion. I want to stay out of your way whilst you are on such a productive roll. Reducing the load on a Watcher using a sequential scan is just shy of essential. I believe that reducing the load on any eventual event driven Watcher will If the proposed change does work, it would be relatively easy to implement, I think I think the above may offer the greatest benefit for time to market. An alternative would be for me to do an independent design study using current mainline My intent would be that such a study never itself become a candidate for merge. There is too much Of course, I will be reading along as you post your efforts. |
Uh oh!
There was an error while loading.Please reload this page.
Expand
pidfdfield in UnixProcessGen2 to take an event watcher factory object which initializes an event watcher which can be used to monitor this process for completion.