Create dataset

A labeled dataset of documents is required to train, up-train, or evaluate aprocessor version.

This page describes how to create a dataset, import documents, and define a schema.To label the imported documents, seeLabel documents.

This page assumes you have alreadycreated a processor that supportstraining, up-training, or evaluation. If your processor is supported, you see theTrain tab in the Google Cloud console.

Dataset storage options

You can choose between two options to save your dataset:

  • Google-managed
  • Custom location Cloud Storage

Unless you have special requirements (for example to keep documents in a set ofCMEK-enabled folders) we recommend the simpler Google-managed storage option.Once created the dataset storage option cannot be changed for the processor.

The folder or subfolder for a custom Cloud Storage location must start emptyand be treated as strictly read-only. Any manual changes to its contents might makethe dataset unusable, risking its loss. The Google-managed storage optiondoes not have this risk.

Follow these steps to provision your storage location.

Google-managed storage (recommended)

  1. Display advanced options while creating a new processor.

    create-dataset-1

  2. Keep the default radio group option toGoogle-managed storage.

    create-dataset-2

  3. SelectCreate.

    create-dataset-3

  4. Confirm dataset is created successfully and dataset location isGoogle-managed location.

    create-dataset-4

Custom storage option

  1. Set the advanced options on or off.

    create-dataset-1

  2. SelectI'll specify my own storage location.

    create-dataset-5

  3. Choose a Cloud Storage folder from the input component.

    create-dataset-6

  4. SelectCreate.

    create-dataset-7

Dataset API operations

This sample shows you how to use theprocessors.updateDatasetmethod to create a dataset. A dataset resource is a singleton resource in a processor,which means that there is no create resource RPC. Instead, you can use theupdateDataset RPC to set the preferences. Document AI provides anoption to store the dataset documents in a Cloud Storage bucket you provideor to have them automatically managed by Google.

Before using any of the request data, make the following replacements:

LOCATION: Your processor locationPROJECT_ID: Your Google Cloud project IDPROCESSOR_ID The ID of your custom processorGCS_URI: Your Cloud Storage URI where dataset documents are stored

Provided bucket

Follow next steps to create a dataset request with a Cloud Storage bucket you provide.

HTTP method

PATCHhttps://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset

Request JSON:

{"name":"projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset""gcs_managed_config"{"gcs_prefix"{"gcs_uri_prefix":"GCS_URI"}}"spanner_indexing_config"{}}

Google managed

In case you want to create the dataset which is Google managed, update the following information:

HTTP method

PATCHhttps://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset

Request JSON:

{"name":"projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset""unmanaged_dataset_config":{}"spanner_indexing_config":{}}

To send your request, you can use Curl:

Note: The following command assumes that you have logged in to thegcloud CLI with your account by runninggcloud init orgcloud auth login or byusing Cloud Shell, which automatically logs you into the gcloud CLI.You can check the active account by running gcloud auth list.

Save the request body in a file namedrequest.json. Execute the following command:

CURL

curl-XPATCH\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json; charset=utf-8"\-d@request.json\"https://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset"

You should receive a JSON response similar to the following:

{"name":"projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID"}

Import documents

A newly created dataset is empty. To add documents, selectImport Documentsand select one or more Cloud Storage folders that contain the documents youwant to add to your dataset.

Note: Each imported document can contain up to 50 pages.

If your Cloud Storage is in a different Google Cloud project, make sure togrant access so that Document AI is allowed to read files from thatlocation. Specifically, you must grant theStorage Object Viewer role toDocument AI's core service agentservice-{project-id}@gcp-sa-prod-dai-core.iam.gserviceaccount.com. For moreinformation, seeService agents.

Warning: Make sure the file names don't contain the following unsupportedcharacters* ? [ ] %.

create-dataset-8

Then, choose one of the following assignment options:

  • Training: Assign to training set.
  • Test: Assign to test set.
  • Auto-split: Randomly shuffles documents in training and test set.
  • Unassigned: Is not used in training or evaluation. You can manuallyassign later.

You can always modify the assignments later.

When you selectImport, Document AI imports all of thesupported file types as well as JSONDocument files into thedataset. For JSONDocumentfiles, Document AI imports the document and converts itsentitiesinto label instances.

Document AI does not modify the import folder or read from the folderafter import is complete.

SelectActivity at the top of the page to open theActivity panel, whichlists the files that were successfully imported as well as those that failed toimport.

If you already have an existing version of your processor, you can select theImport with auto-labeling checkbox in theImport documents dialog. Thedocuments are auto-labeled using the previous processor when they are imported.You cannot train or up-train on auto-labeled documents, or use them in the testset, without marking them as labeled. After you import auto-labeled documents,manually review and correct the auto-labeled documents. Then, selectSaveto save the corrections and mark the document as labeled. You can then assignthe documents as appropriate. SeeAuto-labeling.

Import documents RPC

This sample shows you how to use thedataset.importDocuments method to import documents into the dataset.

Before using any of the request data, make the following replacements:

LOCATION: Your processor locationPROJECT_ID: Your Google Cloud project IDPROCESSOR_ID: The ID of your custom processorGCS_URI: Your Cloud Storage URI where dataset documents are storedDATASET_TYPE: The dataset type to which you want to add documents. The value should be either `DATASET_SPLIT_TRAIN` or `DATASET_SPLIT_TEST`.TRAINING_SPLIT_RATIO: The ratio of documents which you want to autoassign to the training set.

Train or test dataset

If you want to add documents to either training or test dataset:

HTTP method

POSThttps://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/importDocuments

Request JSON:

{"batch_documents_import_configs":{"dataset_split":DATASET_TYPE"batch_input_config":{"gcs_prefix":{"gcs_uri_prefix":GCS_URI}}}}

Train & test dataset

If you want to autosplit the documents between the training and test dataset:

HTTP method

POSThttps://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/importDocuments

Request JSON:

{"batch_documents_import_configs":{"auto_split_config":{"training_split_ratio":TRAINING_SPLIT_RATIO},"batch_input_config":{"gcs_prefix":{"gcs_uri_prefix":"gs://test_sbindal/pdfs-1-page/"}}}}

Save the request body in a file namedrequest.json, and execute the following command:

CURL

curl-XPOST\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json; charset=utf-8"\-d@request.json\"https://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/importDocuments"

You should receive a JSON response similar to the following:

{"name":"projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID"}
Tip: You can useImportDocumentsMetadatato get the status of each document import. We suggest storing all theDocumentId returnedas part of import metadata, which can be used to get and delete individual documentsfrom the dataset.

Delete documents RPC

This sample shows you how to use thedataset.batchDeleteDocuments method to delete documents from the dataset.

Before using any of the request data, make the following replacements:

LOCATION: Your processor locationPROJECT_ID: Your Google Cloud project IDPROCESSOR_ID: The ID of your custom processorDOCUMENT_ID: The document ID blob returned by <code>ImportDocuments</code> request

Delete documents

HTTP method

POSThttps://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/batchDeleteDocuments

Request JSON:

{"dataset_documents":{"individual_document_ids":{"document_ids":DOCUMENT_ID}}}

Save the request body in a file namedrequest.json, and execute the following command:

CURL

curl-XPOST\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json; charset=utf-8"\-d@request.json\"https://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/batchDeleteDocuments"

You should receive a JSON response similar to the following:

{"name":"projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID"}

Assign documents to training or test set

UnderData split, select documents and assign them toeither the training set, test set, or unassigned.

create-dataset-9

Best practices for test set

The quality of your test set determines the quality of your evaluation.

The test set should be created at the beginning of the processor developmentcycle and locked in so that you can track the processor's quality over time.

We recommend at least 100 documents per document type for the test set. It iscritical to ensure that thetest set is representative of the types ofdocuments that customers are using for the model being developed.

The test set should be representative of production traffic in terms offrequency. For example, if you are processing W2 forms and expect 70% to be foryear 2020 and 30% to be for year 2019, then ~70% of the test set should consistof W2 2020 documents. Such a test set composition ensures appropriate importanceis given to each document subtype when evaluating the processor's performance.Also, if you are extracting people's names from international forms, make surethat your test set includes forms from all targeted countries.

Best practices for the training set

Any documents that have already been included in the test set shouldn't beincluded in the training set.

Unlike the test set, thefinal training set doesn't need to be as strictlyrepresentative of the customer usage in terms of document diversity orfrequency. Some labels are more difficult to train than others. Thus, you mightget better performance by skewing the training set towards those labels.

In the beginning, there isn't a good way to figure out which labels aredifficult. You should start with a small, randomly-sampled initial training setusing the same approach described for thetest set. This initialtraining set should contain roughly 10% of the total number of documents thatyou plan to annotate. Then, you can iteratively evaluate the processor quality(looking for specific error patterns) and add more training data.

Define processor schema

After you create a dataset, you can define a processor schema either before orafter you import documents.

The processor'sschema defines the labels, such as name and address, toextract from your documents.

SelectEdit Schema and then create, edit, enable, and disable labels asnecessary.

Make sure to selectSave when you are finished.

Note: For the custom extractor, a maximum of 150 unique labels are supported.

create-dataset-10

Notes on schema label management:

  • Once a schema label is created, the schema label's name cannot be edited.

  • A schema label can only be edited or deleted when there are no trained processorversions. Only data type and occurrence type can be edited.

  • Disabling a label also does not affect prediction. When you send aprocessing request, the processor version extracts all labels that wereactive at the time of training.

Get data schema

This sample shows you how to use the dataset.getDatasetSchemato get the current schema.DatasetSchema is a singleton resource, which isautomatically created when you create a dataset resource.

Before using any of the request data, make the following replacements:

LOCATION: Your processor locationPROJECT_ID: Your Google Cloud project IDPROCESSOR_ID: The ID of your custom processor

Get data schema

HTTP method

GEThttps://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/datasetSchema

CURL

curl-XPOST\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json; charset=utf-8"\-d@request.json\"https://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/datasetSchema"

You should receive a JSON response similar to the following:

{"name":"projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/datasetSchema","documentSchema":{"entityTypes":[{"name":$SCHEMA_NAME,"baseTypes":["document"],"properties":[{"name":$LABEL_NAME,"valueType":$VALUE_TYPE,"occurrenceType":$OCCURRENCE_TYPE,"propertyMetadata":{}},],"entityTypeMetadata":{}}]}}

Update document schema

This sample shows you how to use thedataset.updateDatasetSchemato update the current schema. This example shows you a command to update thedataset schema to have one label. If you want to add a new label, not deleteor update existing labels, then you can callgetDatasetSchema first and makeappropriate changes in its response.

Before using any of the request data, make the following replacements:

LOCATION: Your processor locationPROJECT_ID: Your Google Cloud project IDPROCESSOR_ID: The ID of your custom processorLABEL_NAME: The label name which you want to addLABEL_DESCRIPTION: Describe what the label representsDATA_TYPE: The type of the label. You can specify this asstring,number,currency,money,datetime,address,boolean.OCCURRENCE_TYPE: Describes the number of times this label is expected. Pick anenum value.

Update schema

HTTP method

PATCHhttps://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/datasetSchema

Request JSON:

{"document_schema":{"entityTypes":[{"name":$SCHEMA_NAME,"baseTypes":["document"],"properties":[{"name":LABEL_NAME,"description":LABEL_DESCRIPTION,"valueType":DATA_TYPE,"occurrenceType":OCCURRENCE_TYPE,"propertyMetadata":{}},],"entityTypeMetadata":{}}]}}

Save the request body in a file namedrequest.json, and execute the following command:

CURL

curl-XPOST\-H"Authorization: Bearer$(gcloudauthprint-access-token)"\-H"Content-Type: application/json; charset=utf-8"\-d@request.json\"https://LOCATION-documentai.googleapis.com/v1beta3/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/dataset/datasetSchema"

Choose label attributes

Data type

  • Plain text: a string value.
  • Number: a number - integer or floating point.
  • Money: a monetary value amount. When labeling, don't include the currencysymbol.
  • Currency: a currency symbol.
  • Datetime: a date or time value.
    • When the entity is extracted, it isnormalized to theISO 8601text format.
  • Address - a location address.
    • When the entity is extracted, it is normalized and enriched withEKG.
  • Checkbox - atrue orfalse boolean value.
  • Signature - atrue orfalse boolean value innormalized_value.signature_valuethat indicates if a Signature is present. It supports thederive methods.
  • mention_text - aDetected or an empty"" boolean value inhas_signedthat indicates if a Signature is present. It supports thederive methods.
  • normalized_value.text - aDetected or an empty"" boolean value inhas_signedthat indicates if a Signature is present. It supports thederive methods.
  • normalized_value.boolean_value isn't populated.
Note: Schema entities with typeSignature can only have aderive method.Note: Theextract andderive methods can only be defined at the leaf field.It is similar todate type.

Method

Occurrence

ChooseREQUIRED if an entity is expected to always appear in documents of agiven type. ChooseOPTIONAL if there is no such expectation.

ChooseONCE if an entity is expected to have onevalue, even if the samevalue appears multiple times in the same document. ChooseMULTIPLE if anentity is expected to have multiple values.

Parent and child labels

Parent-child labels (also known as tabular entities) are used to label data in atable. The following table contains 3 rows and 4 columns.

create-dataset-11

You can define such tables using parent-child labels. In this example, the parentlabelline-item defines a row of the table.

Create a parent label

  1. On theEdit schema page, selectCreate Label.

  2. Select theThis is a parent label checkbox, and enter the other information.The parent labelmust have an occurrence of eitheroptional_multiple orrequire_multiple sothat it can be repeated to capture all the rows in the table.

  3. SelectSave.

create-dataset-12

The parent label appears on theEdit schema page, with anAdd Child Labeloption next to it.

To create a child label

  1. Next to the parent label on theEdit schema page, selectAdd Child Label.

  2. Enter the information for the child label.

  3. SelectSave.

Repeat for each child label you want to add.

The child labels appear indented under the parent label on theEdit schemapage.

create-dataset-13

Parent-child labels are apreview feature and are only supported for tables.Nesting depth is limited to 1, meaning that child entities cannot contain otherchild entities.

Create Schema Labels from Labeled Documents

Automatically create schema labels by importing pre-labeledDocument JSON files.

WhileDocument import is in progress, newly addedschema labels are added to the Schema Editor. Select 'Edit Schema' to verify orchange new schema labels data type and occurrence type. Once confirmed, selectschema labels and selectEnable.

Sample datasets

To aid in getting started using Document AI Workbench, datasets are provided in apublic Cloud Storage bucket that includes pre-labeled and unlabeled sampleDocument JSON files of multiple document types.

These can be used for up-training or custom extractors depending on the document type.

gs://cloud-samples-data/documentai/Custom/gs://cloud-samples-data/documentai/Custom/1040/gs://cloud-samples-data/documentai/Custom/Invoices/gs://cloud-samples-data/documentai/Custom/Patents/gs://cloud-samples-data/documentai/Custom/Procurement-Splitter/gs://cloud-samples-data/documentai/Custom/W2-redacted/gs://cloud-samples-data/documentai/Custom/W2/gs://cloud-samples-data/documentai/Custom/W9/

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.