Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Make batch.id robust to warning messages from sbatch#314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
bwcompton wants to merge3 commits intomlr-org:main
base:main
Choose a base branch
Loading
frombwcompton:bwcompton-robust-sbatch

Conversation

@bwcompton
Copy link

I ran into a crazy bug today:getJobStatus gave mebatch.id = "that". It turns out that when I requested a large amount of memory,sbatch returned this um, helpful message:

sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit.Submitted batch job 38139957

clusterFunctionsSlurm was pulling the 4th word of the first line, which should have been the Slurm jobid, but instead was "that". It wanted, of course, the last line.

This really isn't a bug inbatchtools, as the sysops inserted an informational message in a crazy place. But I suspect if the smart, on the ball people at the UMass Unity cluster are doing this, others probably are too. It'd be nice forbatchtools to be robust to such shenanigans. Alternatively, I suppose it could throw an error if batch.id is non-numeric and print the message fromsbatch.

My suggested change looks for a line beginning with "Submitted batch job" and pulls the 4th word as thebatch.id.

I've tested this change against the following:

output <- 'Submitted batch job 12345678'output <- 'This is a crazy informational message\nSubmitted batch job 98765432'output <- 'This is crazy\nand uncalled for\nSubmitted batch job 5555555\nand even more stuff'

as well as against real-lifesubmitJobs calls, both with and without the informational message.

@HenrikBengtsson
Copy link

HenrikBengtsson commentedSep 5, 2025
edited
Loading

You might want to create an issue for this that reference this pull request. At least I tend to miss or forget about PR-only issues over time, and I know other repos like an issue with details where discussions can take place.

Now, I had a look atrunOSCommand(), which is what captures the output per

res= suppressWarnings(system2(command=sys.cmd,args=sys.args,stdin=stdin,stdout=TRUE,stderr=TRUE,wait=TRUE))

That captures both stdout and stderr. It could be that it would be more sane if those two are captured separately, e.g. something likestdout = TRUE andstderr = "error.log", where the expected output should go to stdout and info messages to stderr. To test if that would have helped you, if you do

$ sbatch --time=00:01:00 --mem=128G --wrap="hostname"> stdout.log2> stderr.log

what does

$ cat stdout.log$ cat stderr.log

output? With Slurm, you should see "Submitted batch job ..." instdout.log. Now, my hope is that "sbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Checkhttps://docs.unity.rc.umass.edu/nodes for an appropriate limit." ends up instderr.log for you.

@bwcompton
Copy link
Author

Nice!

bcompton_umass_edu@login1:~$ sbatch --time=00:01:00 --mem=128G --wrap="hostname" > stdout.log 2> stderr.logbcompton_umass_edu@login1:~$ cat stdout.logSubmitted batch job 42933105bcompton_umass_edu@login1:~$ cat stderr.logsbatch: INFO: Note that 128 GB per node will require a node with more than 128 GB memory because of overhead. Check https://docs.unity.rc.umass.edu/nodes for an appropriate limit.bcompton_umass_edu@login1:~$

It looks like you can do a cleaner fix than what I came up with.

HenrikBengtsson reacted with thumbs up emoji

@HenrikBengtsson
Copy link

I've been prototyping with a more flexiblerunOSCommand() in myfuture.batchtools package. It has new argumentsstdout andstderr with defaultstdout = TRUE andstderr = TRUE (backward compatible). The specialstderr = NA with capture stderr separately from stdout.

@bwcompton , although it'sfuture.batchtools and notbatchtools, could you please give it a spin? If it works, then I can propose this newerrunOSCommand() version tobatchtools, plus adjustments tomakeClusterFunctionSlurm(), which I also patch infuture.batchtools.

To try it out, install it as:

remotes::install_github("futureverse/future.batchtools",ref="develop")

and then try it as:

library(future)plan(future.batchtools::batchtools_slurm)f<- future({  Sys.info()[["nodename"]] })v<- value(f)print(v)

Seehttps://future.batchtools.futureverse.org/reference/batchtools_slurm.html for how to control sbatch resource specifications.

@bwcompton
Copy link
Author

bwcompton commentedSep 12, 2025 via email

Thanks! I tried your code snippet, and it can't find slurm_script. AmI missing something?Brad
library(future)> plan(future.batchtools::batchtools_slurm)> f <- future({ Sys.info()[["nodename"]] })> v <- value(f)Error: Future (<unnamed-1>) of class BatchtoolsSlurmFuture expired, which indicates that it crashed or was killed.
Post-mortem details:Future state: ‘running’Batchtools status: ‘defined’, ‘expired’, ‘submitted’Slurm job ID: [n=1] ‘43049392’Slurm 'squeue' job status: <empty>Slurm 'sacct' job status: 43049392|FAILED|1:0The last few lines of the logged output:Session information:- timestamp: 2025-09-12 14:36:54+0000- hostname: cpu016- Rscript path:/var/spool/slurm/slurmd/job43049392/slurm_script: line 20: Rscript:command not found- Rscript version:/var/spool/slurm/slurmd/job43049392/slurm_script: line 21: Rscript:command not found- Rscript library paths:Rscript -e 'batchtools::doJobCollection()' ...- job name: 'jobb9686511f15322fe9d3568b52c61e703'- job log file:'/work/pi_cschweik_umass_edu/marsh_mapping/salt-marsh-mapping/.future/20250912_143653-MdNjCh/batchtools_1109039380/logs/jobb9686511f15322fe9d3568b52c61e703.log'- job uri: '/work/pi_cschweIn addition: Warning messages:1: batchtools::waitForJobs(..., timeout = 2592000) returned FALSE2: In delete.BatchtoolsFuture(future) : Will not remove batchtools registry, because the status of thebatchtools was ‘error’, ‘defined’, ‘expired’, ‘submitted’ and futurebackend argument 'delete' is ‘on-success’:‘/work/pi_cschweik_umass_edu/marsh_mapping/salt-marsh-mapping/.future/20250912_143653-MdNjCh/batchtools_1109039380’>
On Fri, Sep 12, 2025 at 12:40 AM Henrik Bengtsson ***@***.***> wrote: *HenrikBengtsson* left a comment (mlr-org/batchtools#314) <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlr-org%2Fbatchtools%2Fpull%2F314%23issuecomment-3283634371&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358099900%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XSa2XbNjVl2pEPjiaPXiUSbZBlFeMfOnjzt%2BWHgnS4c%3D&reserved=0> I've been prototyping with a more flexible runOSCommand() in my *future.batchtools* package. It has new arguments stdout and stderr with default stdout = TRUE and stderr = TRUE (backward compatible). The special stderr = NA with capture stderr separately from stdout.@bwcompton <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbwcompton&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358131610%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=vTGFeNjU5AT84YQi7cImnSLAgErc%2FccVCsEk7YVPUX8%3D&reserved=0> , although it's *future.batchtools* and not *batchtools*, could you please give it a spin? If it works, then I can propose this newer runOSCommand() version to *batchtools*, plus adjustments to makeClusterFunctionSlurm(), which I also patch in *future.batchtools*. To try it out, install it as: remotes::install_github("futureverse/future.batchtools", ref="develop") and then try it as: library(future) plan(future.batchtools::batchtools_slurm)f <- future({ Sys.info()[["nodename"]] })v <- value(f) print(v) Seehttps://future.batchtools.futureverse.org/reference/batchtools_slurm.html <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffuture.batchtools.futureverse.org%2Freference%2Fbatchtools_slurm.html&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358143281%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=u%2BGwQhkidnbRGl%2B7%2BEhIoDeTG3Ad4EtkBfRWJW8y1PQ%3D&reserved=0> for how to control sbatch resource specifications. — Reply to this email directly, view it on GitHub <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlr-org%2Fbatchtools%2Fpull%2F314%23issuecomment-3283634371&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358155056%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=GUmEXkgvmyPWWMJhaP1xc%2Btun4fBFDFOIhHQGag6NsQ%3D&reserved=0>, or unsubscribe <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAUIZI2VZFGCGL3NUUAKXKZL3SJFD3AVCNFSM6AAAAAB7G4SBCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOBTGYZTIMZXGE&data=05%7C02%7Cbcompton%40eco.umass.edu%7Cd88f15012e12443945a508ddf1b680d0%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C638932488358166124%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Qg2x%2FPh2UFME%2FwznQtEl24kLIxJHVvEeoj7KqoM0d0I%3D&reserved=0> . You are receiving this because you were mentioned.Message ID: ***@***.***>

@HenrikBengtsson
Copy link

Rscript: command not found

R is not available by default in your jobs. Do you load an environment module to get access to R? If so, specify that I'm in theresources argument, e.g.

plan(future.batchtools::batchtools_slurm, resources = list(modules = "r"))

This is illustrated also inhttps://future.batchtools.futureverse.org/reference/batchtools_slurm.html

If you use other techniques to make R available in a job script, please let me know

@HenrikBengtsson
Copy link

That said, the job submission itself actually worked! It's just that R didn't start, which means the patch works

@bwcompton
Copy link
Author

Great news that the patch works.

Here's what I've got in my template,slurm.tmpl. I'm not sure how to squeeze this into the resources option--this is something I got help with from a sysadmin. It works great with batchtools.

## Call batchtools inside containermodule load apptainer/latestexport APPTAINER_BINDPATH="/run/munge,/var/run/munge,/etc/slurm,/var/spool/slurm/slurmd/conf-cache/slurm.conf,$APPTAINER_BINDPATH"apptainer exec /modules/admin-resources/ood-dev/unity-r_4.4.0.sif Rscript --no-restore --quiet --no-save -e 'batchtools::doJobCollection("<%= uri %>")'
HenrikBengtsson reacted with thumbs up emoji

@HenrikBengtsson
Copy link

I'm not sure how to squeeze this into the resources option

Unfortunately not possible today; you'd have to create your own custom template file. But, I've createdfutureverse/future.batchtools#99 to add support for this too. Stay tuned.

@bwcompton
Copy link
Author

Okay, I'll look forward to future.batchtools in the future.

Do you have what you need from me to address the original issue in this PR?

@HenrikBengtsson
Copy link

Do you have what you need from me to address the original issue in this PR?

Yes, I'd like to have a success story over atfuture.batchtools first, ideally some mileage from other users, and have my patch "ripe" enough, before I "bug" thebatchtools maintainers here. So, I'll ping you again over atfutureverse/future.batchtools#99 for you to test. Thanks.

@bwcompton
Copy link
Author

Deal! Thanks so much for your help with this.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@bwcompton@HenrikBengtsson

[8]ページ先頭

©2009-2025 Movatter.jp