Prepare data for ingesting

How you prepare data depends on the kind of data you're importing and the wayyou choose to import it. Start with what kind of data you plan to import:

For information about blended search, where multiple data stores can beconnected to a single custom search app, seeAbout connecting multiple datastores.

Note: Use multiple smaller data stores and search apps instead of one large data store. As your data store size increases, the search latency increases. Splitting your data into smaller, targeted data stores and connecting them to separate search apps helps maintain optimal performance and minimizes latency.

Website data

When you create a data store for website data, you provide the URLs of webpages that Google should crawl and index for searching or recommending.

Before indexing your website data:

Decide which URL patterns to include in your indexing and which to exclude.
- Eliminate duplicate URLs using canonical URL patterns. This providesa single canonical URL for Google Search when crawling the website andremoves ambiguity. For examples of canonicalization and moreinformation, see What is URL canonicalization andHow to specify a canonical URL with rel="canonical" and other methods.

You can include URL patterns either from the same or different domains thatneed to be indexed and exclude patterns that must not be indexed. The number ofURL patterns that you can include and exclude differs in the following way:

Indexing type	Included sites	Excluded sites
Basic website search	Maximum of 50 URL patterns	Maximum of 50 URL patterns
Advanced website indexing	Maximum of 500 URL patterns	Maximum of 500 URL patterns

If you use therobots.txt file in your website, dothe following:
- Verify Google's crawlers and fetchers.
- Make sure thatGoogle-CloudVertexBotcan access your content. The Vertex AI Search bot needs to crawland index your information, including any paywalled content. For example:
```
User-agent:Google-CloudVertexBotAllow:/
```
  It is necessary for the Vertex AI Search bot to crawl andindex your information, including the paywalled content. For moreinformation on crawling and indexing your content, including paywalledcontent, see the following:
  - Structured data for subscription and paywalled content
  - Fix Search-related JavaScript problems
- Check that the web pages you plan to add to your data store don't blockindexing.
For more information, seeIntroduction to robot.txtandHow to write and submit a robots.txt file.
If you plan to useAdvanced website indexing, youmust be able toverify the domains for the URL patternsin your data store.
Add structured data in the form ofmeta tags and PageMaps to yourdata store schema to enrich your indexing as explained inUse structured data for advanced website indexing.
To see an example of inline documents ingestion, run the "Inline documents ingestion" notebook in one of the following environments:
Open in Colab |View on GitHub

Unstructured data

Vertex AI Search supports search over documents that are inTXT, PDF, HTML, DOCX, PPTX, XLSX, and XLSM formats.

The maximum size for a file is 200 MB, and you can import up to 100,000files at a time.

You import your documents from aCloud Storage bucket. You can import using Google Cloud console, by theImportDocuments method, or by streaming ingestionthrough CRUD methods.For API reference information, seeDocumentServiceanddocuments.If you plan to include embeddings in your unstructured data, seeUse custom embeddings.

If you have non-searchable PDFs (scanned PDFs or PDFs with text inside images,such as infographics), we recommend turning on the layout parserduring data store creation. This allowsVertex AI Search to extract elements such as text blocks andtables. If you have searchable PDFs that are mostly composed of machine-readabletext and contain many tables, you can consider turning on OCR processing withthe option for machine-readable text enabled in order to improve detection andparsing. For more information, seeParse and chunkdocuments.

If you want to use Vertex AI Search for retrieval-augmented generation(RAG), turn on document chunking when you create your data store. For moreinformation, seeParse and chunk documents.

You can import unstructured data from the following sources:

Cloud Storage

You can import data from Cloud Storage with or without metadata.

Data import is recursive. That is, if there are folders within the bucket orfolder that you specify, files within those folders are imported.

If you plan to import documents from Cloud Storage without metadata, put yourdocuments directly into a Cloud Storage bucket. The document ID is an exampleof metadata.

For testing, you can use the following publicly available Cloud Storagefolders, which contain PDFs:

gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs
gs://cloud-samples-data/gen-app-builder/search/CUAD_v1
gs://cloud-samples-data/gen-app-builder/search/kaiser-health-surveys
gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224

If you plan to import data from Cloud Storage with metadata, put a JSON filethat contains the metadata into a Cloud Storage bucket whose location youprovide during import.

Your unstructured documents can be in the same Cloud Storage bucket as yourmetadata or a different one.

The metadata file must be aJSON Lines or an NDJSON file. The document ID is anexample of metadata. Each row of the metadata file must follow one of thefollowing JSON formats:

UsingjsonData:
- { "id": "<your-id>", "jsonData": "<JSON string>", "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }
UsingstructData:
- { "id": "<your-id>", "structData": { <JSON object> }, "content": { "mimeType": "<application/pdf or text/html>", "uri": "gs://<your-gcs-bucket>/directory/filename.pdf" } }

Use theuri field in each row to point to the Cloud Storage location ofthe document.

Here is an example of an NDJSON metadata file for an unstructured document. Inthis example, each line of the metadata file points to a PDF document andcontains the metadata for that document. The first two lines usejsonData andthe second two lines usestructData. WithstructData you don't need toescape quotation marks that appear within quotation marks.

{"id":"doc-0","jsonData":"{\"title\":\"test_doc_0\",\"description\":\"This document uses a blue color theme\",\"color_theme\":\"blue\"}","content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_0.pdf"}}{"id":"doc-1","jsonData":"{\"title\":\"test_doc_1\",\"description\":\"This document uses a green color theme\",\"color_theme\":\"green\"}","content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_1.pdf"}}{"id":"doc-2","structData":{"title":"test_doc_2","description":"This document uses a red color theme","color_theme":"red"},"content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_3.pdf"}}{"id":"doc-3","structData":{"title":"test_doc_3","description":"This is document uses a yellow color theme","color_theme":"yellow"},"content":{"mimeType":"application/pdf","uri":"gs://test-bucket-12345678/test_doc_4.pdf"}}

To create your data store, seeCreate a search data store.

BigQuery

If you plan to import metadata from BigQuery, create aBigQuery table that contains metadata. The document ID is an exampleof metadata.

Put your unstructured documents into a Cloud Storage bucket.

Use the following BigQuery schema. Use theuri field ineach record to point to the Cloud Storage location of the document.

[  {    "name": "id",    "mode": "REQUIRED",    "type": "STRING",    "fields": []  },  {    "name": "jsonData",    "type": "STRING",    "fields": []  },  {    "name": "content",    "type": "RECORD",    "mode": "NULLABLE",    "fields": [      {        "name": "mimeType",        "type": "STRING",        "mode": "NULLABLE"      },      {        "name": "uri",        "type": "STRING",        "mode": "NULLABLE"      }    ]  }]

For more information, seeCreate and use tablesin the BigQuery documentation.

To create your data store, seeCreate a search data store.

Google Drive

Syncing data from Google Drive is supported for custom search.

If you plan to import data from Google Drive, you must set up Google Identityas your identity provider in Vertex AI Search. For information aboutsetting up access control, seeUse data source accesscontrol.

To create your data store, seeCreate a search data store.

Structured data

Prepare your data according to the import method that you plan to use. If youplan to ingest media data, also seeStructured media data.

You can import structured data from the following sources:

When you import structured data from BigQuery or from Cloud Storage,you are given the option to import the data with metadata. (Structured withmetadata is also referred to asenhanced structured data.)

BigQuery

You can import structured data from BigQuery datasets.

Your schema is auto-detected. After importing, Google recommends that youedit the auto-detected schema to map key properties, such as titles. If youimport using the API instead of the Google Cloud console, you have the optionto provide your own schema as a JSON object. For more information, seeProvide or auto-detect a schema.

For examples of publicly available structured data, see theBigQuery public datasets.

If you plan to include embeddings in your structured data, seeUse custom embeddings.

If you select to import structured data with metadata, you include two fields inyour BigQuery tables:

Anid field to identify the document. If you import structured datawithout metadata, then theid is generated for you. Including metadatalets you specify the value ofid.
AjsonData field that contains the data. For examples ofjsonData strings,see the preceding sectionCloud Storage.

Use the following BigQuery schema for structured data with metadataimports:

[  {    "name": "id",    "mode": "REQUIRED",    "type": "STRING",    "fields": []  },  {    "name": "jsonData",    "mode": "NULLABLE",    "type": "STRING",    "fields": []  }]

For instructions on creating your data store, seeCreate a search data storeorCreate a recommendations data store.

Notes:

You can't import from BigQuery tables that have externaldata sources. For information about external data sources inBigQuery, see Introduction to external tablesin theBigQuery documentation.
If your BigQuery tables contain columns with flexible columnnames, those columns aren't imported. For information about flexible columnnames, seeFlexible column namesin theBigQuery documentation.

Cloud Storage

Structured data in Cloud Storage must be in eitherJSONLines orNDJSON format. Each file must be less than 2 GB, and each row of the fileless than 1 MB. You can import up to 1,000 files in a single importrequest.

For examples of publicly available structured data, refer to the followingfolders in Cloud Storage, which contain NDJSON files:

gs://cloud-samples-data/gen-app-builder/search/kaggle_movies
gs://cloud-samples-data/gen-app-builder/search/austin_311

If you plan to include embeddings in your structured data, seeUse custom embeddings.

Here is an example of an NDJSON metadata file of structured data. Each line ofthe file represents a document and is made up of a set of fields.

{"id": 10001, "title": "Hotel 1", "location": {"address": "1600 Amphitheatre Parkway, Mountain View, CA 94043"}, "available_date": "2024-02-10", "non_smoking": true, "rating": 3.7, "room_types": ["Deluxe", "Single", "Suite"]}{"id": 10002, "title": "Hotel 2", "location": {"address": "Manhattan, New York, NY 10001"}, "available_date": "2023-07-10", "non_smoking": false, "rating": 5.0, "room_types": ["Deluxe", "Double", "Suite"]}{"id": 10003, "title": "Hotel 3", "location": {"address": "Moffett Park, Sunnyvale, CA 94089"}, "available_date": "2023-06-24", "non_smoking": true, "rating": 2.5, "room_types": ["Double", "Penthouse", "Suite"]}

To create your data store, seeCreate a search data store orCreate a recommendations data store.

Local JSON data

You can directly upload a JSON document or object using the API.

Google recommends providing your own schema as a JSON object for better results. Ifyou don't provide your own schema, the schema is auto-detected. Afterimporting, we recommend that you edit the auto-detected schema to map keyproperties, such as titles. For more information, seeProvide or auto-detect a schema.

If you plan to include embeddings in your structured data, seeUse custom embeddings.

To create your data store, seeCreate a search data store orCreate a recommendations data store.

Structured media data

If you plan to ingest structured media data, such as videos, news, or music,review the following:

Information about your import method (BigQuery orCloud Storage):Structured data
Required schemas and fields for media documents and data stores:About media documents and data stores
User event requirements and schemas:About media user events
Information about media recommendations types:About media recommendations types

Healthcare FHIR data

If you plan to ingest FHIR data from Cloud Healthcare API, ensure the following:

Location: The source FHIR store must be in a Cloud Healthcare API dataset that'sin theus-central1,us, oreu location. For more information, seeCreate and manage datasets in Cloud Healthcare API.
FHIR store type: The source FHIR store must be an R4 data store. You cancheck the versions of your FHIR stores bylisting the FHIR stores in your dataset.To create a FHIR R4 store, seeCreate FHIR stores.
Import quota: The source FHIR store must have fewer than 1 million FHIR resources.If there are more than 1 million FHIR resources, the import process stopsafter this limit is reached. For more information, seeQuotas and limits.
Review the list of FHIR R4 resources that Vertex AI Searchsupports. For more information, seeHealthcare FHIR R4 data schema reference.
Resource references: Ensure that relative resource references are in theformatResource/resourceId. For example,subject.reference must have itsvalue asPatient/034AB16. For more information on how Cloud Healthcare APIsupports FHIR resource references, seeFHIR resource references.
The files referenced in aDocumentReference resourcemust be PDF, RTF, or image files that are stored in Cloud Storage. Thelink to the referenced files must be in thecontent[].attachment.url fieldof the resource in the standard Cloud Storage path format:gs://BUCKET_NAME/PATH_TO_REFERENCED_FILE.
The following table lists the file size limits of each file type withdifferent configurations (for more information, seeParse and chunkdocuments). You can import up to 100,000 files at atime.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Prepare data for ingesting Stay organized with collections Save and categorize content based on your preferences.

Website data

Unstructured data

Cloud Storage

BigQuery

Google Drive

Structured data

BigQuery

Cloud Storage

Local JSON data

Structured media data

Healthcare FHIR data

Prepare data for ingesting