Refresh web pages using automatic and manual refresh Stay organized with collections Save and categorize content based on your preferences.
If your data store uses basic website search, the freshness ofyour store's index mirrors the freshness that's available inGoogle Search.
If advanced website indexing is enabled in your data store, theweb pages in your data store are refreshed in the following ways:
- Automatic refresh
- Manual refresh
- Sitemap-based refresh
This page describes automatic and manual refresh. To understand and implementsitemap-based refresh, seeIndex and refresh according tositemap.
Before you begin
If you use therobots.txt file in your website, update it. For moreinformation, see how toprepare your website'srobots.txt file.
Automatic refresh
Vertex AI Search performs automatic refresh as follows:
- After you create a data store, it generates an initial index for the includedpages.
- After the initial indexing, it indexes any newly discovered pages and recrawlsexisting pages on a best-effort basis.
- It regularly refreshes data stores that encounter a query rate of 50 queries/30 days.
Manual refresh
If you want to refresh specific web pages in a data store withAdvanced website indexing turned on, youcan call therecrawlUris method. You use theuris field to specify eachweb page that you want to crawl. TherecrawlUris method is along-runningoperation that runs until your specified web pages arecrawled or until it times out after 24 hours, whichever comes first. If therecrawlUris method times out you can call the method again, specifying the webpages that remain to be crawled. You can poll theoperations.get method tomonitor the status of your recrawl operation.
recrawlUris method recognizes literal URIs, not URI patterns. Any asterisk (*) in your URI is treated as a regular character. This is different from specifying the URLs to index when creating a website data store. When creating a data store, you can specify individual web pages or use wildcards to specify an entire website or part of a website—for example,www.mysite.com/*. By contrast,recrawlUris assumeswww.mysite.com/* is a single page. For more information about creating a website data store, seeWebsite URLs.Limits on recrawling
There are limits to how often you can crawl web pages and how many web pagesthat you can crawl at a time:
- Calls per day. The maximum number of calls to the
recrawlUrismethod allowed is 20 per day, per project. - Web pages per call. The maximum number of
urisvalues that you canspecify with a call to therecrawlUrismethod is 10,000.
Recrawl the web pages in your data store
You can manually crawl specific web pages in a data store that hasAdvanced website indexing turned on.
REST
To use the command line to crawl specific web pages in your data store, followthese steps:
Find your data store ID. If you already have your data storeID, skip to the next step.
In the Google Cloud console, go to theAI Applications page andin the navigation menu, clickData Stores.
Click the name of your data store.
On theData page for your data store, get the data store ID.
Call the
recrawlUrismethod, using theurisfield tospecify each web page that you want to crawl. Eachurirepresents a singlepage even if it contains asterisks (*). Wildcard patterns are notsupported.curl-XPOST\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json"\-H"X-Goog-User-Project:PROJECT_ID"\"https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/siteSearchEngine:recrawlUris"\-d'{ "uris": [URIS]}'Replace the following:
PROJECT_ID: the ID of your Google Cloud project.DATA_STORE_ID: the ID of the Vertex AI Search data store.URIS: the list of web pages that you want to crawl—forexample,"https://example.com/page-1", "https://example.com/page-2","https://example.com/page-3".
The output is similar to the following:
{"name":"projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678","metadata":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata"}}Save the
namevalue as input for theoperations.getoperation whenmonitoring the status of your recrawl operation.
Monitor the status of your recrawl operation
TherecrawlUris method, which you use tocrawl web pages in a datastore, is along-running operation that runs until your specified web pages are crawledor until it times out after 24 hours, whichever comes first. You can monitor thestatus of the this long-running operation by polling theoperations.get method, specifying thename value returned by therecrawlUris method. Continue polling until the response indicates that either:(1) All of your web pages are crawled, or (2) The operation timed out before allof your web pages were crawled. IfrecrawlUris times out, you can call itagain, specifying the websites that were not crawled.
REST
To use the command line to monitor the status of a recrawl operation, followthese steps:
Find your data store ID. If you already have your data storeID, skip to the next step.
In the Google Cloud console, go to theAI Applications page andin the navigation menu, clickData Stores.
Click the name of your data store.
On theData page for your data store, get the data store ID.
Poll the
operations.getmethod.curl-XGET\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json"\-H"X-Goog-User-Project:PROJECT_ID"\"https://discoveryengine.googleapis.com/v1alpha/OPERATION_NAME"Replace the following:
PROJECT_ID: the ID of your Google Cloud project.OPERATION_NAME: the operation name, found in thenamefieldreturned in your call to therecrawlUrismethod inRecrawl the web pages in your data store. You can also get the operation name bylisting long-running operations.
Evaluate each response.
If a response indicates that there are pending URIs and the recrawloperation is not done, your web pages are still being crawled. Continuepolling.
Example
{"name":"projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678","metadata":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata","createTime":"2023-09-05T22:07:28.690950Z","updateTime":"2023-09-05T22:22:10.978843Z","validUrisCount":4000,"successCount":2215,"pendingCount":1785},"done":false,"response":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse",}}
The response fields can be described as follows:
createTime: indicates the time that the long-running operation started.updateTime: indicates the last time that the long-running operation metadatawas updated. indicates the metadata updates every five minutes until theoperation is done.validUrisCount: indicates that you specified 4,000 valid URIs inyour call to therecrawlUrismethod.successCount: indicates that 2,215 URIs were successfully crawled.pendingCount: indicates that 1,785 URIs have not yet been crawled.done: a value offalseindicates that the recrawl operation isstill in progress.
If a response indicates that there are no pending URIs (no
pendingCountfield is returned) and the recrawl operation is done, then your web pagesare crawled. Stop polling—you can quit this procedure.Example
{"name":"projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-0123456789012345678","metadata":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata","createTime":"2023-09-05T22:07:28.690950Z","updateTime":"2023-09-05T22:37:11.367998Z","validUrisCount":4000,"successCount":4000},"done":true,"response":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse"}}
The response fields can be described as follows:
createTime: indicates the time that the long-running operation started.updateTime: indicates the last time that the long-running operation metadatawas updated. indicates the metadata updates every five minutes until theoperation is done.validUrisCount: indicates that you specified 4,000 valid URIs inyour call to therecrawlUrismethod.successCount: indicates that 4,000 URIs were successfully crawled.done: a value oftrueindicates that the recrawl operation isdone.
If a response indicates that there are pending URIs and the recrawloperation is done, then the recrawl operation timed out (after 24 hours)before all of your web pages were crawled. Start again atRecrawl theweb pages in your data store. Use the
failedUrisvalues in theoperations.getresponse for the values in theurisfieldin your new call to therecrawlUrismethod.Example.
{"name":"projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/operations/recrawl-uris-8765432109876543210","metadata":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisMetadata","createTime":"2023-09-05T22:07:28.690950Z","updateTime":"2023-09-06T22:09:10.613751Z","validUrisCount":10000,"successCount":9988,"pendingCount":12},"done":true,"response":{"@type":"type.googleapis.com/google.cloud.discoveryengine.v1alpha.RecrawlUrisResponse","failedUris":["https://example.com/page-9989","https://example.com/page-9990","https://example.com/page-9991","https://example.com/page-9992","https://example.com/page-9993","https://example.com/page-9994","https://example.com/page-9995","https://example.com/page-9996","https://example.com/page-9997","https://example.com/page-9998","https://example.com/page-9999","https://example.com/page-10000"],"failureSamples":[{"uri":"https://example.com/page-9989","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9990","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9991","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9992","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9993","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9994","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9995","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9996","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9997","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]},{"uri":"https://example.com/page-9998","failureReasons":[{"corpusType":"DESKTOP","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."},{"corpusType":"MOBILE","errorMessage":"Page was crawled but was not indexed by UCS within 24 hours."}]}]}}
Here are some descriptions of response fields:
createTime. The time that the long-running operation started.updateTime. The last time that the long-running operation metadatawas updated. The metadata updates every five minutes until theoperation is done.validUrisCount. Indicates that you specified 10,000 valid URIs inyour call to therecrawlUrismethod.successCount. Indicates that 9,988 URIs were successfully crawled.pendingCount. Indicates that 12 URIs have not yet been crawled.done. A value oftrueindicates that the recrawl operation isdone.failedUris. A list of URIs that were not crawled before the recrawloperation timed out.failureInfo. Information about URIs that failed to crawl. At most,tenfailureInfoarray values are returned, even if more than tenURIs failed to crawl.errorMessage. The reason a URI failed to crawl, bycorpusType. Formore information, seeError messages.
Timely refresh
Google recommends that you performmanual refresh on your new and updated pagesto ensure that you have the latest index.
Error messages
When you aremonitoring the status of your recrawl operation, if the recrawl operation times out while you arepolling theoperations.get method,operations.get returns error messages forweb pages that were not crawled. The following table lists the error messages,whether the error is transient (a temporary error that resolves itself), and theactions that you can take before retrying therecrawlUris method. You can retryall transient errors immediately. All intransient errors can be retried afterimplementing the remedy.
| Error message | Is it a transient error? | Action before retrying recrawl |
|---|---|---|
| Page was crawled but was not indexed by Vertex AI Search within 24 hours | Yes | Use thefailedUris values in theoperations.get response for the values in theuris field when you call therecrawlUris method. |
Crawling was blocked by the site'srobots.txt | No | Unblock the URI in your website'srobots.txt file, ensure that the Googlebot user agent is permitted to crawl the website, and retry recrawl. For more information, seeHow to write and submit a robots.txt file. If you cannot access therobots.txt file, contact the domain owner. |
| Page is unreachable | No | Check the URI that you specified when you call therecrawlUris method. Ensure you provide the literal URI and not a URI pattern. |
| Crawling timed out | Yes | Use thefailedUris values in theoperations.get response for the values in theuris field when you call therecrawlUris method. |
| Page was rejected by Google crawler | Yes | Use thefailedUris values in theoperations.get response for the values in theuris field when you call therecrawlUris method. |
| URL could not be followed by Google crawler | No | If there are multiple redirects, use the URI from the last redirect and retry |
| Page was not found (404) | No | Check the URI that you specified when you call therecrawlUris method. Ensure you provide the literal URI and not a URI pattern.Any page that responds with a `4xx` error code is removed from the index. |
| Page requires authentication | No | Advanced website indexing doesn't support crawling web pages that require authentication. |
How deleted pages are handled
When a page is deleted, Google recommends that youmanually refresh the deleted URLs.
When your website data store is crawled during either anautomaticor amanual refresh, if a web page responds with a4xx client errorcode or5xx server error code, the unresponsive web page is removed from theindex.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.