Prepare data for ingesting

How you prepare data depends on the kind of data you're importing and the wayyou choose to import it. Start with what kind of data you plan to import:

For information about blended search, where multiple data stores can beconnected to a single custom search app, seeAbout connecting multiple datastores.

Note: Use multiple smaller data stores and search apps instead of one large data store. As your data store size increases, the search latency increases. Splitting your data into smaller, targeted data stores and connecting them to separate search apps helps maintain optimal performance and minimizes latency.

Website data

When you create a data store for website data, you provide the URLs of webpages that Google should crawl and index for searching or recommending.

Before indexing your website data:

Unstructured data

Vertex AI Search supports search over documents that are inTXT, PDF, HTML, DOCX, PPTX, XLSX, and XLSM formats.

The maximum size for a file is 200 MB, and you can import up to 100,000files at a time.

You import your documents from aCloud Storage bucket. You can import using Google Cloud console, by theImportDocuments method, or by streaming ingestionthrough CRUD methods.For API reference information, seeDocumentServiceanddocuments.If you plan to include embeddings in your unstructured data, seeUse custom embeddings.

If you have non-searchable PDFs (scanned PDFs or PDFs with text inside images,such as infographics), we recommend turning on the layout parserduring data store creation. This allowsVertex AI Search to extract elements such as text blocks andtables. If you have searchable PDFs that are mostly composed of machine-readabletext and contain many tables, you can consider turning on OCR processing withthe option for machine-readable text enabled in order to improve detection andparsing. For more information, seeParse and chunkdocuments.

If you want to use Vertex AI Search for retrieval-augmented generation(RAG), turn on document chunking when you create your data store. For moreinformation, seeParse and chunk documents.

You can import unstructured data from the following sources:

Cloud Storage

You can import data from Cloud Storage with or without metadata.

Data import is recursive. That is, if there are folders within the bucket orfolder that you specify, files within those folders are imported.

If you plan to import documents from Cloud Storage without metadata, put yourdocuments directly into a Cloud Storage bucket. The document ID is an exampleof metadata.

For testing, you can use the following publicly available Cloud Storagefolders, which contain PDFs:

  • gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs
  • gs://cloud-samples-data/gen-app-builder/search/CUAD_v1
  • gs://cloud-samples-data/gen-app-builder/search/kaiser-health-surveys
  • gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224

If you plan to import data from Cloud Storage with metadata, put a JSON filethat contains the metadata into a Cloud Storage bucket whose location youprovide during import.

Your unstructured documents can be in the same Cloud Storage bucket as yourmetadata or a different one.

The metadata file must be aJSON Lines or an NDJSON file. The document ID is anexample of metadata. Each row of the metadata file must follow one of thefollowing JSON formats:

  • UsingjsonData:
    • { "id": "<your-id>", "jsonData": "<JSON string>", "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
  • UsingstructData:
    • { "id": "<your-id>", "structData": { <JSON object> }, "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }

Use theuri field in each row to point to the Cloud Storage location ofthe document.

Here is an example of an NDJSON metadata file for an unstructured document. Inthis example, each line of the metadata file points to a PDF document andcontains the metadata for that document. The first two lines usejsonData andthe second two lines usestructData. WithstructData you don't need toescape quotation marks that appear within quotation marks.

{"id":"doc-0","jsonData":"{\"title\":\"test_doc_0\",\"description\":\"This document uses a blue color theme\",\"color_theme\":\"blue\"}","content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_0.pdf"}}{"id":"doc-1","jsonData":"{\"title\":\"test_doc_1\",\"description\":\"This document uses a green color theme\",\"color_theme\":\"green\"}","content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_1.pdf"}}{"id":"doc-2","structData":{"title":"test_doc_2","description":"This document uses a red color theme","color_theme":"red"},"content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_3.pdf"}}{"id":"doc-3","structData":{"title":"test_doc_3","description":"This is document uses a yellow color theme","color_theme":"yellow"},"content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_4.pdf"}}

To create your data store, seeCreate a search data store.

BigQuery

If you plan to import metadata from BigQuery, create aBigQuery table that contains metadata. The document ID is an exampleof metadata.

Put your unstructured documents into a Cloud Storage bucket.

Use the following BigQuery schema. Use theuri field ineach record to point to the Cloud Storage location of the document.

[  {    "name": "id",    "mode": "REQUIRED",    "type": "STRING",    "fields": []  },  {    "name": "jsonData",    "type": "STRING",    "fields": []  },  {    "name": "content",    "type": "RECORD",    "mode": "NULLABLE",    "fields": [      {        "name": "mimeType",        "type": "STRING",        "mode": "NULLABLE"      },      {        "name": "uri",        "type": "STRING",        "mode": "NULLABLE"      }    ]  }]

For more information, seeCreate and use tablesin the BigQuery documentation.

To create your data store, seeCreate a search data store.

Google Drive

Syncing data from Google Drive is supported for custom search.

If you plan to import data from Google Drive, you must set up Google Identityas your identity provider in Vertex AI Search. For information aboutsetting up access control, seeUse data source accesscontrol.

To create your data store, seeCreate a search data store.

Structured data

Prepare your data according to the import method that you plan to use. If youplan to ingest media data, also seeStructured media data.

You can import structured data from the following sources:

When you import structured data from BigQuery or from Cloud Storage,you are given the option to import the data with metadata. (Structured withmetadata is also referred to asenhanced structured data.)

BigQuery

You can import structured data from BigQuery datasets.

Your schema is auto-detected. After importing, Google recommends that youedit the auto-detected schema to map key properties, such as titles. If youimport using the API instead of the Google Cloud console, you have the optionto provide your own schema as a JSON object. For more information, seeProvide or auto-detect a schema.

For examples of publicly available structured data, see theBigQuery public datasets.

If you plan to include embeddings in your structured data, seeUse custom embeddings.

If you select to import structured data with metadata, you include two fields inyour BigQuery tables:

  • Anid field to identify the document. If you import structured datawithout metadata, then theid is generated for you. Including metadatalets you specify the value ofid.

  • AjsonData field that contains the data. For examples ofjsonData strings,see the preceding sectionCloud Storage.

Use the following BigQuery schema for structured data with metadataimports:

[  {    "name": "id",    "mode": "REQUIRED",    "type": "STRING",    "fields": []  },  {    "name": "jsonData",    "mode": "NULLABLE",    "type": "STRING",    "fields": []  }]

For instructions on creating your data store, seeCreate a search data storeorCreate a recommendations data store.

Notes:

Cloud Storage

Structured data in Cloud Storage must be in eitherJSONLines orNDJSON format. Each file must be less than 2 GB, and each row of the fileless than 1 MB. You can import up to 1,000 files in a single importrequest.

For examples of publicly available structured data, refer to the followingfolders in Cloud Storage, which contain NDJSON files:

  • gs://cloud-samples-data/gen-app-builder/search/kaggle_movies
  • gs://cloud-samples-data/gen-app-builder/search/austin_311

If you plan to include embeddings in your structured data, seeUse custom embeddings.

Here is an example of an NDJSON metadata file of structured data. Each line ofthe file represents a document and is made up of a set of fields.

{"id": 10001, "title": "Hotel 1", "location": {"address": "1600 Amphitheatre Parkway, Mountain View, CA 94043"}, "available_date": "2024-02-10", "non_smoking": true, "rating": 3.7, "room_types": ["Deluxe", "Single", "Suite"]}{"id": 10002, "title": "Hotel 2", "location": {"address": "Manhattan, New York, NY 10001"}, "available_date": "2023-07-10", "non_smoking": false, "rating": 5.0, "room_types": ["Deluxe", "Double", "Suite"]}{"id": 10003, "title": "Hotel 3", "location": {"address": "Moffett Park, Sunnyvale, CA 94089"}, "available_date": "2023-06-24", "non_smoking": true, "rating": 2.5, "room_types": ["Double", "Penthouse", "Suite"]}

To create your data store, seeCreate a search data store orCreate a recommendations data store.

Local JSON data

You can directly upload a JSON document or object using the API.

Google recommends providing your own schema as a JSON object for better results. Ifyou don't provide your own schema, the schema is auto-detected. Afterimporting, we recommend that you edit the auto-detected schema to map keyproperties, such as titles. For more information, seeProvide or auto-detect a schema.

If you plan to include embeddings in your structured data, seeUse custom embeddings.

To create your data store, seeCreate a search data store orCreate a recommendations data store.

Structured media data

If you plan to ingest structured media data, such as videos, news, or music,review the following:

Healthcare FHIR data

If you plan to ingest FHIR data from Cloud Healthcare API, ensure the following:

  • Location: The source FHIR store must be in a Cloud Healthcare API dataset that'sin theus-central1,us, oreu location. For more information, seeCreate and manage datasets in Cloud Healthcare API.
  • FHIR store type: The source FHIR store must be an R4 data store. You cancheck the versions of your FHIR stores bylisting the FHIR stores in your dataset.To create a FHIR R4 store, seeCreate FHIR stores.
  • Import quota: The source FHIR store must have fewer than 1 million FHIR resources.If there are more than 1 million FHIR resources, the import process stopsafter this limit is reached. For more information, seeQuotas and limits.
  • Review the list of FHIR R4 resources that Vertex AI Searchsupports. For more information, seeHealthcare FHIR R4 data schema reference.
  • Resource references: Ensure that relative resource references are in theformatResource/resourceId. For example,subject.reference must have itsvalue asPatient/034AB16. For more information on how Cloud Healthcare APIsupports FHIR resource references, seeFHIR resource references.

  • The files referenced in aDocumentReference resourcemust be PDF, RTF, or image files that are stored in Cloud Storage. Thelink to the referenced files must be in thecontent[].attachment.url fieldof the resource in the standard Cloud Storage path format:gs://BUCKET_NAME/PATH_TO_REFERENCED_FILE.

    The following table lists the file size limits of each file type withdifferent configurations (for more information, seeParse and chunkdocuments). You can import up to 100,000 files at atime.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.