Every non-trivial task performed in Toolforge (like executing a script or running a bot) should be dispatched to a job scheduling backend (in this case, Kubernetes), which ensures that the job is run in a suitable place with sufficient resources.
The basic principle of running jobs is fairly straightforward:
Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once.
Jobs should be run from aTool Account.
Information about job creation using thetoolforge jobs run command.
One-off jobs (or normal jobs) are workloads that will be scheduled by Toolforge Kubernetes and run until finished. They will run once, and are expected to finish at some point.
Select aruntime image, acommand in your tool home directory and then usetoolforge jobs run to create the job, example usingjob namemyjob:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--imagesomelang1.23--command./mycommand.sh
The--command option supports input arguments, using quotes, example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--imagesomelang1.23--command"./mycommand.sh --witharguments"
You can instruct the command line to wait and don't return until the job is finished with the--wait option. By default the timeout is 10 minutes, but a custom number of seconds can be specified instead:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--imagesomelang1.23--command./mycommand.sh--waittools.mytool@tools-sgebastion-11:~$toolforgejobsrunnothing--imagesomelang1.23--command"sleep 600"--wait630
To schedule a recurrent job (also known ascron jobs), use the--schedule WHEN option when creating it:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmycronjob--command./daily.sh--imagesomelang1.23--schedule"@daily"--timeout3600
The schedule argument usescron syntax.
Please use the@hourly,@daily,@weekly,@monthly,@yearly macros if possible. Those allow to spread the cluster load evenly through the time period, which makes maintaining the cluster much easier.
@daily doesn't meanonce a day at midnight, since the actual value is internally randomized. Please checkLast schedule time viatoolforge jobs list ortoolforge jobs show mycronjob. Alternatively,kubectl get cronjobs will show the cron expression underSCHEDULE.You can force a rerun of a scheduled job with:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrestartmycronjob
Continuous jobs are programs that are never meant to end. If they end (for example, because of an error) the Toolforge Kubernetes system will restart them.
To create a continuous job, use the--continuous option:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyalwaysrunningjob--command./myendlesscommand.sh--imagesomelang1.23--continuous
In all job types (one-off, continuous, cronjob) the--command parameter should meet the following conditions:
--command mycommand.sh will likely fail (it references $PATH), and--command ./mycommand.sh is likely what you mean.--command "./mycommand.sh --arg1 x --arg2 y".Failing to meet any of these conditions will lead to errors either before launching the job, or shortly after the job is processed by the backend.
The job name is a unique string identifier. The string should meet these criteria:
. (dot) and- (dash) characters.Failing to meet any of these conditions will lead to errors either before launching the job, or shortly after the job is processed by the backend.
In Toolforge Kubernetes you can use any image you built with thebuild service (preferred) or you can use one of the pre-defined container images.
To view which execution runtimes are available, run thetoolforge jobs images command (note that if you are using the build service, you'll have to have built your image already for it to show up).
Example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsimagesShort name Container image URL------------ ----------------------------------------------------------------------bookworm docker-registry.tools.wmflabs.org/toolforge-bookworm-sssd:latestbullseye docker-registry.tools.wmflabs.org/toolforge-bullseye-sssd:latestjdk17 docker-registry.tools.wmflabs.org/toolforge-jdk17-sssd-base:latestjdk21 docker-registry.svc.toolforge.org/toolforge-jdk21-sssd-base:latestmariadb docker-registry.tools.wmflabs.org/toolforge-mariadb-sssd-base:latestmono6.8 docker-registry.tools.wmflabs.org/toolforge-mono68-sssd-base:latestmono6.12 docker-registry.tools.wmflabs.org/toolforge-mono612-sssd-base:latestnode16 docker-registry.tools.wmflabs.org/toolforge-node16-sssd-base:latestnode18 docker-registry.tools.wmflabs.org/toolforge-node18-sssd-base:latestnode20 docker-registry.svc.toolforge.org/toolforge-node20-sssd-base:latestperl5.32 docker-registry.tools.wmflabs.org/toolforge-perl532-sssd-base:latestperl5.36 docker-registry.tools.wmflabs.org/toolforge-perl536-sssd-base:latestperl5.40 docker-registry.svc.toolforge.org/toolforge-perl540-sssd-base:latestphp7.4 docker-registry.tools.wmflabs.org/toolforge-php74-sssd-base:latestphp8.2 docker-registry.tools.wmflabs.org/toolforge-php82-sssd-base:latestphp8.4 docker-registry.svc.toolforge.org/toolforge-php84-sssd-base:latestpython3.9 docker-registry.tools.wmflabs.org/toolforge-python39-sssd-base:latestpython3.11 docker-registry.tools.wmflabs.org/toolforge-python311-sssd-base:latestpython3.13 docker-registry.svc.toolforge.org/toolforge-python313-sssd-base:latestruby2.1 docker-registry.tools.wmflabs.org/toolforge-ruby21-sssd-base:latestruby2.7 docker-registry.tools.wmflabs.org/toolforge-ruby27-sssd-base:latestruby3.1 docker-registry.tools.wmflabs.org/toolforge-ruby31-sssd-base:latestruby3.3 docker-registry.svc.toolforge.org/toolforge-ruby33-sssd-base:latesttcl8.6 docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latesttrixie docker-registry.svc.toolforge.org/toolforge-trixie-sssd:latest
In addition, there are several deprecated images that are available for older tools that rely on them but should not be used for new use cases.
NOTE: if your tool uses python, you may want to use a virtualenv, seeHelp:Toolforge/Python#Jobs.
You can specify the retry policy for failed jobs.
The default policy is to not try to restart failed jobs. But you can choose for them to be retried up to five times before given up by the scheduling engine.
Use the--retry N option. Example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--command./myjob.sh--imagesomelang1.23--retry2
Note that the retry policy will be ignored for continuous jobs, given they are always restarted in case of failure.
You can useenvvars to pass secrets and other configuration variables to your jobs.
You can define a list of jobs in a YAML file and load them all at once using thetoolforge jobs load command, example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsloadjobs.yaml
NOTE: loading jobs from a file will flush jobs with the same name if their definition varies.
You can use the--job <name> option to load only one job as defined in the YAML file. Example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsloadjobs.yaml--job"everyminute"
Example YAML file:
# https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework---# a cronjob-name:hourlycommand:./myothercommand.sh -vimage:bullseyeno-filelog:trueschedule:"@hourly"emails:onfailure# a continuous job-name:endlessjobimage:python3.13command:python3 dumps-daemon.py --endlesscontinuous:trueemails:all# wait for this one-off job before loading the next-name:myjobimage:bullseyecommand:./mycommand.sh --argument1wait:trueemails:onfinish# another one-off job after the previous one finished running-name:anotherjobimage:bullseyecommand:./mycommand.sh --argument1emails:none# this job sets custom stdout/stderr log files-name:normal-job-with-custom-logsimage:bullseyecommand:./mycommand.sh --argument1filelog-stdout:logs/stdout.logfilelog-stderr:logs/stderr.log# this job sets a custom retry policy-name:normal-job-with-custom-retry-policyimage:bullseyecommand:./mycommand.sh --argument1retry:2# this job requests a higher memory limit-name:normal-job-with-higher-memory-limitimage:bullseyecommand:./mycommand.sh --argument1mem:500Mi# this continuous job runs a healthcheck script-name:job-with-healthcheck-scriptimage:bullseyecommand:./some-command.shcontinuous:truehealth-check-script:./some-healthcheck-script.sh# this continuous job has multiple replicas configured-name:job-with-3-replicasimage:bullseyecommand:./some-command.shcontinuous:truereplicas:3
You can do the opposite operation, and get all the defined jobs in YAML format, perhaps for a laterload. Examples:
tools.mytool@tools-bastion-12:~$toolforgejobsdump- command: ./some-script.sh continuous: true image: bookworm name: test- command: ./some-script.sh continuous: true image: bookworm mem: 1G name: test2tools.mytool@tools-bastion-12:~$toolforgejobsdump--to-filemyjobs.yamltools.mytool@tools-bastion-12:~$toolforgejobsloadmyjobs.yaml
To run a job that expects to receive requests from other jobs (say a backend job that expects requests from a frontend job), you need to configure the internal domain name of the job. This way the jobs making the request won't need to know and keep track of the internal IP address of the target job. This is necessary because the internal IP address of jobs are ephemeral.
To configure the internal domain name, you only need to specify the target port like this--port <portnumber>, once that is done, your job new job will now be reachable onhttps://<jobname>:<port>
example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunbackend-continuous-job--command./server.sh--imagesomelang1.23--continuous--port8080
The above job will now be reachable from other jobs by name, for examplehttps://backend-continuous-job:8080
Sometimes your continuous jobs can get stuck on the code level but still appear to be running when you runtoolforge jobs list. Configuring health check can help ensure that toolforge can detect issues like this and restart your continuous job.
We currently support two types of health checks,script andhttp health checks.
To configurescript health check, specify--health-check-script argument, the value of which should either be an inline string or an executable file. example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsruncontinuous-job-with-script-health-check--command./myendlesscommand.sh--imagesomelang1.23--continuous--health-check-script./health-check.sh
chmod u+x health-check.sh) before creating your job with health-check configured.To configurehttp health check, specify--health-check-http argument, the value of which should a http endpoint. You need to also configure port the for your job by providing the--port option.
tools.mytool@tools-sgebastion-11:~$toolforgejobsruncontinuous-job-with-health-check--command./myendlesscommand.sh--imagesomelang1.23--continuous--health-check-http/healthz--port8080
In order to properly work with health checks, your tool/job code needs to be aware of this health check. In particular:
script health checks, The job's main code loop includes some code to create a control file. For example/tmp/myjob-alive. You configure the health check to verify the existence of this file, and to delete it if present. For example:--health-check-script "test -e /tmp/myjob-alive && rm /tmp/myjob-alive". Because the control file was deleted by the health check, if the job is alive it should create the file again in the next loop iteration. If it is not created, the health check will fail, indicating the job is not healthy, and Toolforge will therefore restart the job.http health checks, your job's logic needs to configure the server that serves the endpoint passed to--health-check-http.Checks happen in two different phases: startup and liveness.
Toolforge jobs framework by default creates 1 instance of a job. Sometimes there's a need to run multiple instances of the exact same thing, for example for multiple runner processes.
To create a multi-replica job, you can use the--replicas option:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunbackend-continuous-job--command./server.sh--imagesomelang1.23--continuous--replicas2
You can get information about the jobs created for your tool usingtoolforge jobs list, example:
tools.mytool@tools-sgebastion-11:~$toolforgejobslistJob name: Job type: Status:-------------- ----------------- ---------------------------myscheduledjob schedule: @hourly Last schedule time: 2021-06-30T10:26:00Zalwaysrunning continuous Runningmyjob normal Completed
Listing even more information at once is possible using--output long:
tools.mytool@tools-sgebastion-11:~$toolforgejobslist--outputlongJob name: Command: Job type: Image: File log: Output log: Error log: Emails: Resources: Retry: Status:-------------- ----------------------- ----------------- -------- ----------- ------------- ------------ --------- ------------ -------- ---------myscheduledjob ./read-dumps.sh schedule: @hourly bullseye no /dev/null /dev/null none default no Runningalwaysrunning ./myendlesscommand.sh continuous bullseye yes test2.out test2.err none default no Runningmyjob ./mycommand.sh --debug normal bullseye yes logs/mylog logs/mylog onfinish default 2 Completed
You can also get the list of defined jobs in YAML format, using thedump operation. Examples:
tools.mytool@tools-sgebastion-10:~$toolforgejobslistJob name: Job type: Status:----------- ----------- ---------myjob continuous Runningmyjob2 continuous Runningtools.mytool@tools-sgebastion-10:~$toolforgejobsdump- command: ./some-script.sh continuous: true image: bookworm name: myhob- command: ./some-script.sh continuous: true image: bookworm mem: 1G name: myjob2
You can then save this dump YAML output to a file by either redirecting the output, or selecting the file directly with the-f or--to-file options. All the next examples are equivalent:
tools.mytool@tools-sgebastion-10:~$toolforgejobsdump>jobs.yamltools.mytool@tools-sgebastion-10:~$toolforgejobsdump-fjobs.yamltools.mytool@tools-sgebastion-10:~$toolforgejobsdump--to-filejobs.yaml
You can use this YAML dump file later in aload operation.
You can delete your jobs in two ways:
toolforge jobs delete command.toolforge jobs flush command.You can get information about a defined job using thetoolforge jobs show command, example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsshowmyscheduledjob+--------------+---------------------------------------------------------------+| Job name: | myscheduledjob |+--------------+---------------------------------------------------------------+| Command: | ./read-dumps.sh myargument |+--------------+---------------------------------------------------------------+| Job type: | schedule: * * * * * |+--------------+---------------------------------------------------------------+| Image: | bullseye |+--------------+---------------------------------------------------------------+| File log: | yes |+--------------+---------------------------------------------------------------+| Output log: | /data/project/tool-name/myscheduledjob.out |+--------------+---------------------------------------------------------------+| Error log: | /data/project/tool-name/mysheduledjob.err. |+--------------+---------------------------------------------------------------+| Emails: | none |+--------------+---------------------------------------------------------------+| Resources: | mem: 10Mi, cpu: 100 |+--------------+---------------------------------------------------------------+| Replicas: | 1 |+--------------+---------------------------------------------------------------+| Mounts: | all |+--------------+---------------------------------------------------------------+| Retry: | no |+--------------+---------------------------------------------------------------+| Health check:| none |+--------------+---------------------------------------------------------------+| Status: | Last schedule time: 2021-06-30T10:26:00Z |+--------------+---------------------------------------------------------------+| Hints: | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. || | State 'waiting' for reason 'ContainerCreating'. |+--------------+---------------------------------------------------------------+
This should include information about the job status and some hints (in case of failure, etc).
You can restart cronjobs or continuous jobs.
Usetoolforge jobs restart <jobname>. Example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrestartmyjob
You can use this functionality to reset internal state of stuck jobs or jobs in failed state. The internal behavior is similar to removing the job and defining it again.
Trying to restart a non-existent job will do nothing.
There are currently two possibilities for collecting logs from jobs:
toolforge jobs logs command while a job is running and for a short period after the job has finished.If a job has file logs disabled (it uses a build service image or--no-filelog), the Toolforge Kubernetes infrastructure will internally store the output. To view these logs, use thetoolforge jobs logs:
tools.mytool@tools-sgebastion-11:~$toolforgejobslogsmyjob
This command also takes some flags:
-f to follow logs in real-time-l [number] to only see a specific number of newest log linesJobs log stdout/stderr to files in your tool home directory.
For a jobmyjob, you will find:
myjob.out file, containing stdout generated by your job.myjob.err file, containing stderr generated by your job.Example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--command./mycommand.sh--imagebullseyetools.mytool@tools-sgebastion-11:~$lsmyjob*myjob.out myjob.err
Subsequent same-name job runs will append to the same files.
Log generation can be disabled with the--no-filelog parameter when creating a new job, for example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--command./mycommand.sh--imagebullseye--no-filelog
You can control where you store your logs. This allows for things like:
To do that, make use of the following options when running a new job:
-o path/to/file.log or--filelog-stdout path/to/file.log-e path/to/file.log or--filelog-stderr path/to/file.logExample, running a job that merges both log streams into a single log file:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--command./mycommand.sh--imagebullseye--filelog-stdoutmyjob.log--filelog-stderrmyjob.log
Example, running a job that uses the default`jobname`.out but ignores stderr:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--command./mycommand.sh--imagebullseye--filelog-stderr/dev/null
Example, running a job that log both streams separately in a custom directory:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--command./mycommand.sh--imagebullseye--filelog-stdoutmylogs/myjob.out.log--filelog-stderrmylogs/myjob.err.log
Custom directories should be created by hand previous to the job run. Selecting aninvalid directory here will likely result in the job failing withexit code 2.
Users should take care of log files growing too large.
Themariadb image includes thelogrotate program which can be used to control the sizes of log files using the Toolforge jobs framework.
If you have a continuous job, you will want to usecopytruncate mode for log rotation. To set it up, create a configuration filelogrotate-myjob.conf similar to this:
tools.mytool@tools-sgebastion-11:~$nanologrotate-myjob.conf"./logs/myjob.log"{ daily rotate 6 copytruncate dateext}
This configuration will rotate your log files daily, and keep 6 days of old logs in addition to the log for the current day.Thedateext option renames rotated log files by appending a date to their filenames, allowing for better organization and differentiation of log files based on the date of rotation.
Then you can start automatic log rotation with:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunlogrotate-myjob--command'logrotate -v $TOOL_DATA_DIR/logrotate-myjob.conf --state $TOOL_DATA_DIR/logrotate-myjob.state'--imagemariadb--schedule"@daily"
For rotating all your logs, you can use globs like:
tools.mytool@tools-sgebastion-11:~$cat>logrotate-all.conf"./*.err" "./*.out" { daily rotate 6 copytruncate dateext compress notifempty}
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunlogrotate-all--command'logrotate -v $TOOL_DATA_DIR/logrotate-all.conf --state $TOOL_DATA_DIR/logrotate-all.state'--imagemariadb--schedule"@daily"
Providing more modern approaches and facilities for logs management, metrics, etc. is in the current roadmap for the WMCS team. See Phabricator T127367 for example.
Each tool account has a limited quota available. The same quota is used for jobs and other things potentially running on Kubernetes, like webservices.
To check your quota, run:
tools.mytool@tools-sgebastion-11:~$toolforgejobsquotaRunning jobs Used Limit-------------------------------------------- ------ -------Total running jobs at once (Kubernetes pods) 0 10Running one-off and cron jobs 0 15CPU 0 2Memory 0 8GiPer-job limits Limit---------------- -------CPU 1Memory 4GiJob definitions Used Limit---------------------------------------- ------ -------Cron jobs 0 50Continuous jobs (including web services) 0 3
As of this writing, new jobs get512Mi memory and0.5 CPU by default.
You can run jobs with additional CPU and memory using the--mem MEM and--cpu CPU parameters, example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--command"./heavycommand.sh"--imagebullseye--mem1Gi--cpu2
Requesting more memory or CPU will fail if the tool quota is exceeded.
You can find details on the underlying kubernetes quotashere.
It is possible to request a quota increase if you can demonstrate your tool's need for more resources than the default namespace quota allows. Instructions and a template link for creating a quota request can be found atToolforge (Quota requests) in Phabricator.
Please read all the instructions there before submitting your request.
Note for Toolforge admins: there are docs on how to do quota upgrades.
You can select to receive email notifications from your job activity, by using the--emails EMAILS option when creating a job.
The available choices are:
none, don't get any email notification. The default behavior.onfailure, receive email notifications in case of a failure event.onfinish, receive email notifications in case of the job finishing (both successfully and on failure).all, receive all possible notifications.Example:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrunmyjob--command./mycommand.sh--imagebullseye--emailsonfinish
The email will be sent totools.mytool@toolforge.org, which is an email alias thatby default redirects to all tool maintainers associated with that particular tool account.
List all available jobs-framework commands using thetoolforge jobs -h command:
tools.mytool@tools-sgebastion-11:~$toolforgejobs-husage: toolforge jobs [-h] {images,run,show,logs,list,delete,flush,load,restart,quota,dump} ...Toolforge Jobs Framework, command line interfacepositional arguments: {images,run,show,logs,list,delete,flush,load,restart,quota,dump} possible operations (pass -h to know usage of each) images list information on available container image types for Toolforge jobs run run a new job of your own in Toolforge show show details of a job of your own in Toolforge logs show output from a running job list list all running jobs of your own in Toolforge delete delete a running job of your own in Toolforge flush delete all running jobs of your own in Toolforge load flush all jobs and load a YAML file with job definitions and run them restart restarts a running job quota display quota information dump dump all defined jobs in YAML format, suitable for a later `load` operationoptions: -h, --help show this help message and exit
List all available run command arguments using thetoolforge jobs run -h command:
tools.mytool@tools-sgebastion-11:~$toolforgejobsrun-husage: toolforge jobs run [-h] --command COMMAND --image IMAGE [--no-filelog | --filelog] [-o FILELOG_STDOUT] [-e FILELOG_STDERR] [--retry {0,1,2,3,4,5}] [--mem MEM] [--cpu CPU] [--emails {none,all,onfinish,onfailure}] [--mount {all,none}] [--timeout TIMEOUT] [--schedule SCHEDULE | --continuous | --wait [WAIT]] [--health-check-script HEALTH_CHECK_SCRIPT | --health-check-http HEALTH_CHECK_HTTP] [-p PORT] [--replicas REPLICAS] namepositional arguments: name new job nameoptions: -h, --help show this help message and exit --command COMMAND full path of command to run in this job --image IMAGE image shortname (check them with `images`) --no-filelog disable redirecting job output to files in the home directory --filelog explicitly enable file logs on jobs using a build service created image -o FILELOG_STDOUT, --filelog-stdout FILELOG_STDOUT location to store stdout logs for this job -e FILELOG_STDERR, --filelog-stderr FILELOG_STDERR location to store stderr logs for this job --retry {0,1,2,3,4,5} specify the retry policy of failed jobs. --mem MEM specify additional memory limit required for this job --cpu CPU specify additional CPU limit required for this job --emails {none,all,onfinish,onfailure} specify if the system should email notifications about this job. (default: 'none') --mount {all,none} specify which shared storage (NFS) directories to mount to this job. (default: 'none' on build service images, 'all' otherwise) --timeout TIMEOUT timeout in seconds for a scheduled job before it's stopped --schedule SCHEDULE run a job with a cron-like schedule (example '1 * * * *') --continuous run a continuous job --wait [WAIT] wait for job one-off job to complete, optionally specify a value to override default timeout of 600s --health-check-script HEALTH_CHECK_SCRIPT specify a health check command to run on the job if any. --health-check-http HEALTH_CHECK_HTTP specify a health check endpoint to query on the job if any. -p PORT, --port PORT specify the port to expose for this job. only valid for continuous jobs --replicas REPLICAS specify the number of job replicas to be used. only valid for continuous jobs
The following tools have been built by the Toolforge admin team to help others see job status:
Support and administration of the WMCS resources is provided by theWikimedia Foundation Cloud Services team andWikimedia movement volunteers. Please reach out with questions and join the conversation:
Use a subproject of the#Cloud-ServicesPhabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Read theCloud Services Blog (for the broader Wikimedia movement, see theWikimedia Technical Blog)