Train and evaluate Stay organized with collections Save and categorize content based on your preferences.
Document AI lets you train new processor versions using your own trainingdata and evaluate the quality of your processor version against your own testdata.
This is useful when you want to use a custom processor. There is a Document AIprocessor for your document type, but you can up-train a custom version ofit to meet your needs.
Training and evaluation are typically performed in tandem to iterate towards ahigh quality, usable processor version.
Document AI
Document AI lets you build your owncustom extractor, which extractsentities from documents of a particular type, for example, the items in a menuor the name and contact information from a resume.
Unlike other processors, custom processors don't come with any pretrainedprocessor versions and thus, cannot process any documents until you train aversion from scratch.
To get started with Document AI, seeBuild your own custom processor.
Uptraining a processor
You canuptrain new processor versions to improve accuracy on your data,extract additional custom fields from your documents, and add support for newlanguages.
Up training works by applyingtransfer learning on Google pretrained processorversions and generally requires less data than training from scratch.
To get started, seeUptrain a pretrained processor.
Supported processors
Not all specialized processors support up training. These are the processors that support up training.
Data considerations and recommendations
The quality and the amount of your data determines the quality of the training,uptraining, and evaluation.
Obtaining a set of representative, real-world documents and providing enoughhigh-quality labels are often the most time-consuming and resource-intensivepart of the process.
Number of documents
If your documents all have a similar format (for example, a fixed form with verylow variation), then fewer documents are required to achieve accuracy. Thehigher the variation, the more documents are required.
The following charts provide a rough estimate of the number of documents thatare required for a Custom Document Extractor to achieve a particular qualityscore.
| Low variation | High variation |
|---|---|
![]() | ![]() |
Data labeling
Consider youroptions for labeling documentsand make sure you have enough resources to annotate the documents in yourdataset.
Training models
Custom extractor processors can use different model typesdepending on the specific use case and available training data.
- Custom model: model using labeled training data.
- Template-based: documents with a fixed layout.
- Model-based: documents with some layout variation.
- Generative AI model: based on pretrainedfoundation models that require minimal additional training.
The following table illustrates which use cases correspond to each model type.
| Custom model | Generative AI | ||
|---|---|---|---|
| Template-based | Model-based | ||
| Layout variation | None | Low to medium | High |
| Amount of free-form text (for example, paragraphs in a contract) | Low | Low | High |
| Amount of training data required | Low | High | Low |
| Accuracy with limited training data | Higher | Lower | Higher |
Learn toFine-tune a processor with property descriptions.
When to use another processor
Here are some instances in which you might want to consider options besidesDocument AI Document AI Workbench, or adapt your workflow.
- Certain text-based input formats (.txt, .html, .docx, .md, and so forth) arenot supported by Document AI Document AI Workbench. Consider other prebuilt or customlanguage processing offerings in Google Cloud, such as theCloud Natural Language API.
- The Custom Document Extractor schema supports up to 150 entity labels. If yourbusiness logic requires more than 150 entities in the schema definition, considertraining multiple processors, each targeting a subset of entities.
How to train a processor
Assuming that you have alreadycreated a processor that supports training or uptraining andlabeled your dataset, you can train a new processor version from scratch. Or you canuptrain a new processor version based on an existing one.
Train processor version
Web UI
In the Google Cloud console, go to your processor'sTrain tab.
ClickEdit Schema to open theManage Labels page. Verify the processor's labels.
The labels that areenabled at the time of training determine the entitiesthat your new processor version extracts. If a label is inactive in theschema, the processor version is not extracting that label, even if thedocuments are labeled.
On theTrain tab, clickView Label Stats andverify your test and training set. Documents that areauto-labeled,unlabeled, orunassigned are excluded from training and evaluation.
ClickTrain new version.
TheVersion Name defines the
namefield of theprocessorVersion.
ClickStart training and wait for your new processor version to betrained and evaluated.
You can monitor training progress on theManageVersions tab:

Click theEvaluate & Test tab to see how well your new processor versionperformed on the test set. For more information, seeEvaluate processor version.
Python
For more information, see theDocument AIPython API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.
fromtypingimportOptionalfromgoogle.api_core.client_optionsimportClientOptionsfromgoogle.cloudimportdocumentai# type: ignore# TODO(developer): Uncomment these variables before running the sample.# project_id = 'YOUR_PROJECT_ID'# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'# processor_id = 'YOUR_PROCESSOR_ID'# processor_version_display_name = 'new-processor-version'# train_data_uri = 'gs://bucket/directory/' # (Optional)# test_data_uri = 'gs://bucket/directory/' # (Optional)deftrain_processor_version_sample(project_id:str,location:str,processor_id:str,processor_version_display_name:str,train_data_uri:Optional[str]=None,test_data_uri:Optional[str]=None,)->None:# You must set the api_endpoint if you use a location other than 'us', e.g.:opts=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")client=documentai.DocumentProcessorServiceClient(client_options=opts)# The full resource name of the processor# e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}parent=client.processor_path(project_id,location,processor_id)processor_version=documentai.ProcessorVersion(display_name=processor_version_display_name)# If train/test data is not supplied, the default sets in the Cloud Console will be usedinput_data=documentai.TrainProcessorVersionRequest.InputData(training_documents=documentai.BatchDocumentsInputConfig(gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=train_data_uri)),test_documents=documentai.BatchDocumentsInputConfig(gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=test_data_uri)),)request=documentai.TrainProcessorVersionRequest(parent=parent,processor_version=processor_version,input_data=input_data)operation=client.train_processor_version(request=request)# Print operation detailsprint(operation.operation.name)# Wait for operation to completeresponse=documentai.TrainProcessorVersionResponse(operation.result())metadata=documentai.TrainProcessorVersionMetadata(operation.metadata)print(f"New Processor Version:{response.processor_version}")print(f"Training Set Validation:{metadata.training_dataset_validation}")print(f"Test Set Validation:{metadata.test_dataset_validation}")Deploy and use the processor version
You can deploy and manage your processor versions just like any other processorversion. For more information, seeManaging processor versions.
Once deployed, you canSend a processing request to your customprocessor.
Disable or delete a processor
If you no longer want to use a processor, you can disable or delete it. If youdisable a processor, you can re-enable it. If you delete a processor, you cannotrecover it.
In theDocument AI panel on the left, clickMy processors.
Click the vertical dots to the right of the processor name. ClickDisable processor orDelete processor.
For more information, seeManaging processor versions.
Encryption of training data
Document AI training data is saved in Cloud Storage and can beencrypted withCustomer-managed encryption keysif required.
Deletion of training data
After a Document AI training job is completed, all training data savedin Cloud Storage expire after a two-day retention period. Subsequent datadeletion activities respect the process described inData deletion on Google Cloud.
Pricing
There is no cost for training or up-training. You pay for hosting and prediction.For more information, seeDocument AI Pricing.
Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2026-02-19 UTC.

