Prepare data for ingesting Stay organized with collections Save and categorize content based on your preferences.
How you prepare data depends on the kind of data you're importing and the wayyou choose to import it. Start with what kind of data you plan to import:
For information about blended search, where multiple data stores can beconnected to a single custom search app, seeAbout connecting multiple datastores.
Note: Use multiple smaller data stores and search apps instead of one large data store. As your data store size increases, the search latency increases. Splitting your data into smaller, targeted data stores and connecting them to separate search apps helps maintain optimal performance and minimizes latency.Website data
When you create a data store for website data, you provide the URLs of webpages that Google should crawl and index for searching or recommending.
Before indexing your website data:
Decide which URL patterns to include in your indexing and which to exclude.
Exclude the patterns for dynamic URLs. Dynamic URLs are URLs thatchange at the time of serving depending on the request.
For example, the URL patterns for the web pages that serve the searchresults, such as
www.example.com/search/*. Suppose, a user searches forthe phraseNobel prize, the dynamic search URL might be a unique URL:www.example.com/search?q=nobel%20prize/UNIQUE_STRING.If the URL patternwww.example.com/search/*is not excluded, then allsuch unique, dynamic search URLs that follow this pattern are indexed.This results in a bloated index and a diluted search quality.Eliminate duplicate URLs using canonical URL patterns. This providesa single canonical URL for Google Search when crawling the website andremoves ambiguity. For examples of canonicalization and moreinformation, seeWhat is URL canonicalization andHow to specify a canonical URL with rel="canonical" and other methods.
You can include URL patterns either from the same or different domains thatneed to be indexed and exclude patterns that must not be indexed. The number ofURL patterns that you can include and exclude differs in the following way:
Indexing type Included sites Excluded sites Basic website search Maximum of 50 URL patterns Maximum of 50 URL patterns Advanced website indexing Maximum of 500 URL patterns Maximum of 500 URL patterns If you use the
robots.txtfile in your website, dothe following:Make sure thatGoogle-CloudVertexBotcan access your content. The Vertex AI Search bot needs to crawland index your information, including any paywalled content. For example:
User-agent:Google-CloudVertexBotAllow:/It is necessary for the Vertex AI Search bot to crawl andindex your information, including the paywalled content. For moreinformation on crawling and indexing your content, including paywalledcontent, see the following:
Check that the web pages you plan to add to your data store don't blockindexing.
For more information, seeIntroduction to robot.txtandHow to write and submit a robots.txt file.
If you plan to useAdvanced website indexing, youmust be able toverify the domains for the URL patternsin your data store.
Add structured data in the form of
metatags and PageMaps to yourdata store schema to enrich your indexing as explained inUse structured data for advanced website indexing.
Unstructured data
Vertex AI Search supports search over documents that are inTXT, PDF, HTML, DOCX, PPTX, XLSX, and XLSM formats.
The maximum size for a file is 200 MB, and you can import up to 100,000files at a time.
You import your documents from aCloud Storage bucket. You can import using Google Cloud console, by theImportDocuments method, or by streaming ingestionthrough CRUD methods.For API reference information, seeDocumentServiceanddocuments.If you plan to include embeddings in your unstructured data, seeUse custom embeddings.
If you have non-searchable PDFs (scanned PDFs or PDFs with text inside images,such as infographics), we recommend turning on the layout parserduring data store creation. This allowsVertex AI Search to extract elements such as text blocks andtables. If you have searchable PDFs that are mostly composed of machine-readabletext and contain many tables, you can consider turning on OCR processing withthe option for machine-readable text enabled in order to improve detection andparsing. For more information, seeParse and chunkdocuments.
If you want to use Vertex AI Search for retrieval-augmented generation(RAG), turn on document chunking when you create your data store. For moreinformation, seeParse and chunk documents.
You can import unstructured data from the following sources:
Cloud Storage
You can import data from Cloud Storage with or without metadata.
Data import is recursive. That is, if there are folders within the bucket orfolder that you specify, files within those folders are imported.
If you plan to import documents from Cloud Storage without metadata, put yourdocuments directly into a Cloud Storage bucket. The document ID is an exampleof metadata.
For testing, you can use the following publicly available Cloud Storagefolders, which contain PDFs:
gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfsgs://cloud-samples-data/gen-app-builder/search/CUAD_v1gs://cloud-samples-data/gen-app-builder/search/kaiser-health-surveysgs://cloud-samples-data/gen-app-builder/search/stanford-cs-224
If you plan to import data from Cloud Storage with metadata, put a JSON filethat contains the metadata into a Cloud Storage bucket whose location youprovide during import.
Your unstructured documents can be in the same Cloud Storage bucket as yourmetadata or a different one.
The metadata file must be aJSON Lines or an NDJSON file. The document ID is anexample of metadata. Each row of the metadata file must follow one of thefollowing JSON formats:
- Using
jsonData:{ "id": "<your-id>", "jsonData": "<JSON string>", "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
- Using
structData:{ "id": "<your-id>", "structData": { <JSON object> }, "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
Use theuri field in each row to point to the Cloud Storage location ofthe document.
Here is an example of an NDJSON metadata file for an unstructured document. Inthis example, each line of the metadata file points to a PDF document andcontains the metadata for that document. The first two lines usejsonData andthe second two lines usestructData. WithstructData you don't need toescape quotation marks that appear within quotation marks.
{"id":"doc-0","jsonData":"{\"title\":\"test_doc_0\",\"description\":\"This document uses a blue color theme\",\"color_theme\":\"blue\"}","content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_0.pdf"}}{"id":"doc-1","jsonData":"{\"title\":\"test_doc_1\",\"description\":\"This document uses a green color theme\",\"color_theme\":\"green\"}","content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_1.pdf"}}{"id":"doc-2","structData":{"title":"test_doc_2","description":"This document uses a red color theme","color_theme":"red"},"content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_3.pdf"}}{"id":"doc-3","structData":{"title":"test_doc_3","description":"This is document uses a yellow color theme","color_theme":"yellow"},"content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_4.pdf"}}To create your data store, seeCreate a search data store.
BigQuery
If you plan to import metadata from BigQuery, create aBigQuery table that contains metadata. The document ID is an exampleof metadata.
Put your unstructured documents into a Cloud Storage bucket.
Use the following BigQuery schema. Use theuri field ineach record to point to the Cloud Storage location of the document.
[ { "name": "id", "mode": "REQUIRED", "type": "STRING", "fields": [] }, { "name": "jsonData", "type": "STRING", "fields": [] }, { "name": "content", "type": "RECORD", "mode": "NULLABLE", "fields": [ { "name": "mimeType", "type": "STRING", "mode": "NULLABLE" }, { "name": "uri", "type": "STRING", "mode": "NULLABLE" } ] }]For more information, seeCreate and use tablesin the BigQuery documentation.
To create your data store, seeCreate a search data store.
Google Drive
Syncing data from Google Drive is supported for custom search.
If you plan to import data from Google Drive, you must set up Google Identityas your identity provider in Vertex AI Search. For information aboutsetting up access control, seeUse data source accesscontrol.
To create your data store, seeCreate a search data store.
Structured data
Prepare your data according to the import method that you plan to use. If youplan to ingest media data, also seeStructured media data.
You can import structured data from the following sources:
When you import structured data from BigQuery or from Cloud Storage,you are given the option to import the data with metadata. (Structured withmetadata is also referred to asenhanced structured data.)
BigQuery
You can import structured data from BigQuery datasets.
Your schema is auto-detected. After importing, Google recommends that youedit the auto-detected schema to map key properties, such as titles. If youimport using the API instead of the Google Cloud console, you have the optionto provide your own schema as a JSON object. For more information, seeProvide or auto-detect a schema.
For examples of publicly available structured data, see theBigQuery public datasets.
If you plan to include embeddings in your structured data, seeUse custom embeddings.
If you select to import structured data with metadata, you include two fields inyour BigQuery tables:
An
idfield to identify the document. If you import structured datawithout metadata, then theidis generated for you. Including metadatalets you specify the value ofid.A
jsonDatafield that contains the data. For examples ofjsonDatastrings,see the preceding sectionCloud Storage.
Use the following BigQuery schema for structured data with metadataimports:
[ { "name": "id", "mode": "REQUIRED", "type": "STRING", "fields": [] }, { "name": "jsonData", "mode": "NULLABLE", "type": "STRING", "fields": [] }]For instructions on creating your data store, seeCreate a search data storeorCreate a recommendations data store.
Notes:- You can't import from BigQuery tables that have externaldata sources. For information about external data sources inBigQuery, seeIntroduction to external tablesin theBigQuery documentation.
- If your BigQuery tables contain columns with flexible columnnames, those columns aren't imported. For information about flexible columnnames, seeFlexible column namesin theBigQuery documentation.
Cloud Storage
Structured data in Cloud Storage must be in eitherJSONLines orNDJSON format. Each file must be less than 2 GB, and each row of the fileless than 1 MB. You can import up to 1,000 files in a single importrequest.
For examples of publicly available structured data, refer to the followingfolders in Cloud Storage, which contain NDJSON files:
gs://cloud-samples-data/gen-app-builder/search/kaggle_moviesgs://cloud-samples-data/gen-app-builder/search/austin_311
If you plan to include embeddings in your structured data, seeUse custom embeddings.
Here is an example of an NDJSON metadata file of structured data. Each line ofthe file represents a document and is made up of a set of fields.
{"id": 10001, "title": "Hotel 1", "location": {"address": "1600 Amphitheatre Parkway, Mountain View, CA 94043"}, "available_date": "2024-02-10", "non_smoking": true, "rating": 3.7, "room_types": ["Deluxe", "Single", "Suite"]}{"id": 10002, "title": "Hotel 2", "location": {"address": "Manhattan, New York, NY 10001"}, "available_date": "2023-07-10", "non_smoking": false, "rating": 5.0, "room_types": ["Deluxe", "Double", "Suite"]}{"id": 10003, "title": "Hotel 3", "location": {"address": "Moffett Park, Sunnyvale, CA 94089"}, "available_date": "2023-06-24", "non_smoking": true, "rating": 2.5, "room_types": ["Double", "Penthouse", "Suite"]}To create your data store, seeCreate a search data store orCreate a recommendations data store.
Local JSON data
You can directly upload a JSON document or object using the API.
Google recommends providing your own schema as a JSON object for better results. Ifyou don't provide your own schema, the schema is auto-detected. Afterimporting, we recommend that you edit the auto-detected schema to map keyproperties, such as titles. For more information, seeProvide or auto-detect a schema.
If you plan to include embeddings in your structured data, seeUse custom embeddings.
To create your data store, seeCreate a search data store orCreate a recommendations data store.
Structured media data
If you plan to ingest structured media data, such as videos, news, or music,review the following:
- Information about your import method (BigQuery orCloud Storage):Structured data
- Required schemas and fields for media documents and data stores:About media documents and data stores
- User event requirements and schemas:About media user events
- Information about media recommendations types:About media recommendations types
Healthcare FHIR data
If you plan to ingest FHIR data from Cloud Healthcare API, ensure the following:
- Location: The source FHIR store must be in a Cloud Healthcare API dataset that'sin the
us-central1,us, oreulocation. For more information, seeCreate and manage datasets in Cloud Healthcare API. - FHIR store type: The source FHIR store must be an R4 data store. You cancheck the versions of your FHIR stores bylisting the FHIR stores in your dataset.To create a FHIR R4 store, seeCreate FHIR stores.
- Import quota: The source FHIR store must have fewer than 1 million FHIR resources.If there are more than 1 million FHIR resources, the import process stopsafter this limit is reached. For more information, seeQuotas and limits.
- Review the list of FHIR R4 resources that Vertex AI Searchsupports. For more information, seeHealthcare FHIR R4 data schema reference.
Resource references: Ensure that relative resource references are in theformat
Resource/resourceId. For example,subject.referencemust have itsvalue asPatient/034AB16. For more information on how Cloud Healthcare APIsupports FHIR resource references, seeFHIR resource references.The files referenced in a
DocumentReferenceresourcemust be PDF, RTF, or image files that are stored in Cloud Storage. Thelink to the referenced files must be in thecontent[].attachment.urlfieldof the resource in the standard Cloud Storage path format:gs://BUCKET_NAME/PATH_TO_REFERENCED_FILE.The following table lists the file size limits of each file type withdifferent configurations (for more information, seeParse and chunkdocuments). You can import up to 100,000 files at atime.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.
Open in Colab
View on GitHub