- Notifications
You must be signed in to change notification settings - Fork471
Milestone
Description
Reportedly, some VM jobs (and possibly others) get in a "stuck" state where they
don't make progress: no fraction done change, and little CPU usage.
These jobs will eventually be aborted when their elapsed time reaches the rsc_fpops_bound limit,
but this could take weeks or months depending on the limit.
Proposal: have the client try to figure out when a job is stuck.
ACTIVE_TASK new fields: double stuck_check_elapsed_time double stuck_check_fraction_done double stuck_check_cpu_time (initialize all to zero)STUCK_CHECK_POLL_PERIOD = 3600every STUCK_CHECK_POLL_PERIOD seconds for each active task atp if non_cpu_intensive: continue if sporadic: continue if atp->stuck_check_elapsed_time == 0 atp->stuck_check_elapsed_time = atp->elapsed_time atp->stuck_check_fraction_done = atp->fraction_done atp->stuck_check_cpu_time = atp->current_cpu_time continue if atp->elapsed_time < atp->stuck_check_elapsed_time + STUCK_CHECK_POLL_PERIOD continue if atp->stuck_check_fraction_done == atp->fraction_done && (atp->current_cpu_time - atp->stuck_check_cpu_time < 10) (job is stuck - print warning) atp->stuck_check_elapsed_time = atp->elapsed_time atp->stuck_check_fraction_done = atp->fraction_done atp->stuck_check_cpu_time = atp->current_cpu_time
e.g. in the last hour of running, the fraction done hasn't changed,
and the incremental CPU time is < 10s.
At that point, the client could
- notify the user, suggesting that they abort the job
- abort the job
Let's do 1) for starters, to make sure that the logic is right,
then at some point do 2).
Metadata
Metadata
Assignees
Type
Projects
Status
In progress