Refresh web pages using automatic and manual refresh

Key Term: The termrefresh can be used interchangeablywith the termrecrawl in the context of a website. Refreshing orrecrawling fetches the most recent version of a page andindexes it.Reindexing indexes documents that have already been crawled.

If your data store uses basic website search, the freshness ofyour store's index mirrors the freshness that's available inGoogle Search.

If advanced website indexing is enabled in your data store, theweb pages in your data store are refreshed in the following ways:

  • Automatic refresh
  • Manual refresh
  • Sitemap-based refresh

This page describes automatic and manual refresh. To understand and implementsitemap-based refresh, seeIndex and refresh according tositemap.

Before you begin

If you use therobots.txt file in your website, update it. For moreinformation, see how toprepare your website'srobots.txt file.

Automatic refresh

Vertex AI Search performs automatic refresh as follows:

  • After you create a data store, it generates an initial index for the includedpages.
  • After the initial indexing, it indexes any newly discovered pages and recrawlsexisting pages on a best-effort basis.
  • It regularly refreshes data stores that encounter a query rate of 50 queries/30 days.
Note: TheAutomatic refresh feature is a discovery-based process for findingnew pages that are not listed in a sitemap. When using sitemaps for preciseindexing control, we recommend disabling this feature. DisablingAutomaticrefresh does not stop the separate refresh process based on sitemapmodifications. For more information, seeIndex and refresh web pages using sitemaps.

Manual refresh

If you want to refresh specific web pages in a data store withAdvanced website indexing turned on, youcan call therecrawlUris method. You use theuris field to specify eachweb page that you want to crawl. TherecrawlUris method is along-runningoperation that runs until your specified web pages arecrawled or until it times out after 24 hours, whichever comes first. If therecrawlUris method times out you can call the method again, specifying the webpages that remain to be crawled. You can poll theoperations.get method tomonitor the status of your recrawl operation.

Note: TherecrawlUris method recognizes literal URIs, not URI patterns. Any asterisk (*) in your URI is treated as a regular character. This is different from specifying the URLs to index when creating a website data store. When creating a data store, you can specify individual web pages or use wildcards to specify an entire website or part of a website—for example,www.mysite.com/*. By contrast,recrawlUris assumeswww.mysite.com/* is a single page. For more information about creating a website data store, seeWebsite URLs.

Limits on recrawling

There are limits to how often you can crawl web pages and how many web pagesthat you can crawl at a time:

  • Calls per day. The maximum number of calls to therecrawlUrismethod allowed is 20 per day, per project.
  • Web pages per call. The maximum number ofuris values that you canspecify with a call to therecrawlUris method is 10,000.

Recrawl the web pages in your data store

You can manually crawl specific web pages in a data store that hasAdvanced website indexing turned on.

REST

To use the command line to crawl specific web pages in your data store, followthese steps:

  1. Find your data store ID. If you already have your data storeID, skip to the next step.

    1. In the Google Cloud console, go to theAI Applications page andin the navigation menu, clickData Stores.

      Go to the Data Stores page

    2. Click the name of your data store.

    3. On theData page for your data store, get the data store ID.

  2. Call therecrawlUris method, using theuris field tospecify each web page that you want to crawl. Eachuri represents a singlepage even if it contains asterisks (*). Wildcard patterns are notsupported.

    curl-XPOST\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json"\-H"X-Goog-User-Project:PROJECT_ID"\"https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/siteSearchEngine:recrawlUris"\-d'{  "uris": [URIS]}'

    Replace the following:

    • PROJECT_ID: the ID of your Google Cloud project.
    • DATA_STORE_ID: the ID of the Vertex AI Search data store.
    • URIS: the list of web pages that you want to crawl—forexample,"https://example.com/page-1", "https://example.com/page-2","https://example.com/page-3".

    The output is similar to the following:

    {"name":"projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678","metadata":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata"}}
  3. Save thename value as input for theoperations.get operation whenmonitoring the status of your recrawl operation.

Monitor the status of your recrawl operation

TherecrawlUris method, which you use tocrawl web pages in a datastore, is along-running operation that runs until your specified web pages are crawledor until it times out after 24 hours, whichever comes first. You can monitor thestatus of the this long-running operation by polling theoperations.get method, specifying thename value returned by therecrawlUris method. Continue polling until the response indicates that either:(1) All of your web pages are crawled, or (2) The operation timed out before allof your web pages were crawled. IfrecrawlUris times out, you can call itagain, specifying the websites that were not crawled.

REST

To use the command line to monitor the status of a recrawl operation, followthese steps:

  1. Find your data store ID. If you already have your data storeID, skip to the next step.

    1. In the Google Cloud console, go to theAI Applications page andin the navigation menu, clickData Stores.

      Go to the Data Stores page

    2. Click the name of your data store.

    3. On theData page for your data store, get the data store ID.

  2. Poll theoperations.get method.

    curl-XGET\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json"\-H"X-Goog-User-Project:PROJECT_ID"\"https://discoveryengine.googleapis.com/v1alpha/OPERATION_NAME"

    Replace the following:

  3. Evaluate each response.

    • If a response indicates that there are pending URIs and the recrawloperation is not done, your web pages are still being crawled. Continuepolling.

      Example

      {"name":"projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678","metadata":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata","createTime":"2023-09-05T22:07:28.690950Z","updateTime":"2023-09-05T22:22:10.978843Z","validUrisCount":4000,"successCount":2215,"pendingCount":1785},"done":false,"response":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse",}}

      The response fields can be described as follows:

      • createTime: indicates the time that the long-running operation started.
      • updateTime: indicates the last time that the long-running operation metadatawas updated. indicates the metadata updates every five minutes until theoperation is done.
      • validUrisCount: indicates that you specified 4,000 valid URIs inyour call to therecrawlUris method.
      • successCount: indicates that 2,215 URIs were successfully crawled.
      • pendingCount: indicates that 1,785 URIs have not yet been crawled.
      • done: a value offalse indicates that the recrawl operation isstill in progress.

    • If a response indicates that there are no pending URIs (nopendingCountfield is returned) and the recrawl operation is done, then your web pagesare crawled. Stop polling—you can quit this procedure.

      Example

      {"name":"projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678","metadata":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata","createTime":"2023-09-05T22:07:28.690950Z","updateTime":"2023-09-05T22:37:11.367998Z","validUrisCount":4000,"successCount":4000},"done":true,"response":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse"}}

      The response fields can be described as follows:

      • createTime: indicates the time that the long-running operation started.
      • updateTime: indicates the last time that the long-running operation metadatawas updated. indicates the metadata updates every five minutes until theoperation is done.
      • validUrisCount: indicates that you specified 4,000 valid URIs inyour call to therecrawlUris method.
      • successCount: indicates that 4,000 URIs were successfully crawled.
      • done: a value oftrue indicates that the recrawl operation isdone.
  4. If a response indicates that there are pending URIs and the recrawloperation is done, then the recrawl operation timed out (after 24 hours)before all of your web pages were crawled. Start again atRecrawl theweb pages in your data store. Use thefailedUrisvalues in theoperations.get response for the values in theuris fieldin your new call to therecrawlUris method.

    Example.

    {"name":"projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-8765432109876543210","metadata":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata","createTime":"2023-09-05T22:07:28.690950Z","updateTime":"2023-09-06T22:09:10.613751Z","validUrisCount":10000,"successCount":9988,"pendingCount":12},"done":true,"response":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse","failedUris":["https://example.com/page-9989","https://example.com/page-9990","https://example.com/page-9991","https://example.com/page-9992","https://example.com/page-9993","https://example.com/page-9994","https://example.com/page-9995","https://example.com/page-9996","https://example.com/page-9997","https://example.com/page-9998","https://example.com/page-9999","https://example.com/page-10000"],"failureSamples":[{"uri":"https://example.com/page-9989","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9990","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9991","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9992","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9993","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9994","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9995","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9996","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9997","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9998","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]}]}}

    Here are some descriptions of response fields:

    • createTime. The time that the long-running operation started.
    • updateTime. The last time that the long-running operation metadatawas updated. The metadata updates every five minutes until theoperation is done.
    • validUrisCount. Indicates that you specified 10,000 valid URIs inyour call to therecrawlUris method.
    • successCount. Indicates that 9,988 URIs were successfully crawled.
    • pendingCount. Indicates that 12 URIs have not yet been crawled.
    • done. A value oftrue indicates that the recrawl operation isdone.
    • failedUris. A list of URIs that were not crawled before the recrawloperation timed out.
    • failureInfo. Information about URIs that failed to crawl. At most,tenfailureInfo array values are returned, even if more than tenURIs failed to crawl.
    • errorMessage. The reason a URI failed to crawl, bycorpusType. Formore information, seeError messages.

Timely refresh

Google recommends that you performmanual refresh on your new and updated pagesto ensure that you have the latest index.

Error messages

When you aremonitoring the status of your recrawl operation, if the recrawl operation times out while you arepolling theoperations.get method,operations.get returns error messages forweb pages that were not crawled. The following table lists the error messages,whether the error is transient (a temporary error that resolves itself), and theactions that you can take before retrying therecrawlUris method. You can retryall transient errors immediately. All intransient errors can be retried afterimplementing the remedy.

Error messageIs it a transient error?Action before retrying recrawl
Page was crawled but was not indexed by Vertex AI Search within 24 hoursYesUse thefailedUris values in theoperations.get response for the values in theuris field when you call therecrawlUris method.
Crawling was blocked by the site'srobots.txtNoUnblock the URI in your website'srobots.txt file, ensure that the Googlebot user agent is permitted to crawl the website, and retry recrawl. For more information, seeHow to write and submit a robots.txt file. If you cannot access therobots.txt file, contact the domain owner.
Page is unreachableNoCheck the URI that you specified when you call therecrawlUris method. Ensure you provide the literal URI and not a URI pattern.
Crawling timed outYesUse thefailedUris values in theoperations.get response for the values in theuris field when you call therecrawlUris method.
Page was rejected by Google crawlerYesUse thefailedUris values in theoperations.get response for the values in theuris field when you call therecrawlUris method.
URL could not be followed by Google crawlerNoIf there are multiple redirects, use the URI from the last redirect and retry
Page was not found (404)NoCheck the URI that you specified when you call therecrawlUris method. Ensure you provide the literal URI and not a URI pattern.

Any page that responds with a `4xx` error code is removed from the index.

Page requires authenticationNoAdvanced website indexing doesn't support crawling web pages that require authentication.

How deleted pages are handled

When a page is deleted, Google recommends that youmanually refresh the deleted URLs.

When your website data store is crawled during either anautomaticor amanual refresh, if a web page responds with a4xx client errorcode or5xx server error code, the unresponsive web page is removed from theindex.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.