Jobs retries and checkpoints best practices

Individual job tasks or even job executions can fail for a variety of reasons.This page contains best practices to handle these failures, centered aroundtask restarts and job checkpointing.

Use task retries

Individual job tasks can fail for a variety of reasons, including issues withapplication dependencies, quotas, or even internal system events. Often suchissues are transient and the task will succeed after a retry.

By default, each task will automatically retry up to 3 times. This helps ensurea job will run to completion even if it encounters transient task failures. Youcan alsocustomize the maximum number of retries.However, if you do change the default, you should specify at least one retry.

Plan for job task restarts

Make your jobsidempotent,so that a task restart does not result in corrupt or duplicate output.That is, write repeatable logic that has the same behavior for a given set ofinputs no matter how many times it is repeated or when it is executed.

Write your output to a different location than the input data, leaving inputdata intact. This way, if the job runs again, the job can repeat the processfrom the beginning and get the same result.

Avoid duplicating output data by reusing the same unique identifier orchecking if the output already exists. Duplicate data representscollection-level data corruption.

Use checkpointing

Where possible, checkpoint your jobs so that if a task restarts after afailure, it can pick up where it left off, instead of restarting work at thebeginning. Doing this will speed up your jobs as well as minimize unnecessarycosts.

Periodically write partial results and an indication of progress made to apersistent storage location such as Cloud Storage or a database. Whenyour task starts, look for partial results upon startup. If partial resultsare found, begin processing where they left off.

If your job does not lend itself to checkpointing, consider breaking it upinto smaller chunks and run a larger number of tasks.

Checkpointing example 1: calculating Pi

If you have a job that executes a recursive algorithm, such as calculating Pi tomany decimal places, and uses parallelism set to a value of 1:

Write your progress every 10 minutes or whatever your lost work toleranceallows, to api-progress.txt Cloud Storage object.
When a task starts, query thepi-progress.txt object and load the value asa starting place. Use that value as the initial input to your function.
Write your final result to Cloud Storage as an object namedpi-complete.txt to avoid duplication via parallel or repeated execution orpi-complete-DATE.txt to differentiate by completion date.

Checkpointing example 2: processing 10,000 records from Cloud SQL

If you have a job processing 10,000 records in a relational database such asCloud SQL:

Retrieve records to be processed with a SQL query such asSELECT * FROM example_table LIMIT 10000
Write out updated records in batches of 100 so significant processing work isnot lost on interruption.
When records are written, note which ones have been processed. You might add aboolean column processed to the table which is set to 1 only if processing isconfirmed.
When a task starts, the query used to retrieve items for processing should addthe condition processed = 0.
In addition to clean retries, this technique also supports breaking up workinto smaller tasks, such as by modifying your query to select 100 records at atime:LIMIT 100 OFFSET $CLOUD_RUN_TASK_INDEX*100, and running 100 tasks to process all 10,000 records.CLOUD_RUN_TASK_INDEX is a built-in environment variable present inside thecontainer running Cloud Run jobs.

Using all these pieces together, the final query might look like this:SELECT * FROM example_table WHERE processed = 0 LIMIT 100 OFFSET $CLOUD_RUN_TASK_INDEX*100

What's next

To create a Cloud Run job, seeCreate jobs.
To execute a job, seeExecute jobs.
To execute a job on a schedule, seeExecute jobs on a schedule.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-18 UTC.

Movatterモバイル変換

Jobs retries and checkpoints best practices Stay organized with collections Save and categorize content based on your preferences.