Train and evaluate

Document AI lets you train new processor versions using your own trainingdata and evaluate the quality of your processor version against your own testdata.

This is useful when you want to use a custom processor. There is a Document AIprocessor for your document type, but you can up-train a custom version ofit to meet your needs.

Training and evaluation are typically performed in tandem to iterate towards ahigh quality, usable processor version.

Document AI

Document AI lets you build your owncustom extractor, which extractsentities from documents of a particular type, for example, the items in a menuor the name and contact information from a resume.

Unlike other processors, custom processors don't come with any pretrainedprocessor versions and thus, cannot process any documents until you train aversion from scratch.

To get started with Document AI, seeBuild your own custom processor.

Uptraining a processor

You canuptrain new processor versions to improve accuracy on your data,extract additional custom fields from your documents, and add support for newlanguages.

Up training works by applyingtransfer learning on Google pretrained processorversions and generally requires less data than training from scratch.

To get started, seeUptrain a pretrained processor.

Supported processors

Not all specialized processors support up training. These are the processors that support up training.

    Data considerations and recommendations

    The quality and the amount of your data determines the quality of the training,uptraining, and evaluation.

    Obtaining a set of representative, real-world documents and providing enoughhigh-quality labels are often the most time-consuming and resource-intensivepart of the process.

    Number of documents

    If your documents all have a similar format (for example, a fixed form with verylow variation), then fewer documents are required to achieve accuracy. Thehigher the variation, the more documents are required.

    The following charts provide a rough estimate of the number of documents thatare required for a Custom Document Extractor to achieve a particular qualityscore.

    Low variationHigh variation
    processor-training-and-evaluation-overview-1processor-training-and-evaluation-overview-2

    Data labeling

    Consider youroptions for labeling documentsand make sure you have enough resources to annotate the documents in yourdataset.

    Training models

    Custom extractor processors can use different model typesdepending on the specific use case and available training data.

    • Custom model: model using labeled training data.
      • Template-based: documents with a fixed layout.
      • Model-based: documents with some layout variation.
    • Generative AI model: based on pretrainedfoundation models that require minimal additional training.

    The following table illustrates which use cases correspond to each model type.

    Custom modelGenerative AI
    Template-basedModel-based
    Layout variationNoneLow to mediumHigh
    Amount of free-form text (for example, paragraphs in a contract)LowLowHigh
    Amount of training data requiredLowHighLow
    Accuracy with limited training dataHigherLowerHigher

    Learn toFine-tune a processor with property descriptions.

    When to use another processor

    Here are some instances in which you might want to consider options besidesDocument AI Document AI Workbench, or adapt your workflow.

    • Certain text-based input formats (.txt, .html, .docx, .md, and so forth) arenot supported by Document AI Document AI Workbench. Consider other prebuilt or customlanguage processing offerings in Google Cloud, such as theCloud Natural Language API.
    • The Custom Document Extractor schema supports up to 150 entity labels. If yourbusiness logic requires more than 150 entities in the schema definition, considertraining multiple processors, each targeting a subset of entities.

    How to train a processor

    Assuming that you have alreadycreated a processor that supports training or uptraining andlabeled your dataset, you can train a new processor version from scratch. Or you canuptrain a new processor version based on an existing one.

    Train processor version

    Web UI

    1. In the Google Cloud console, go to your processor'sTrain tab.

      Go to the Processors Gallery

    2. ClickEdit Schema to open theManage Labels page. Verify the processor's labels.

      The labels that areenabled at the time of training determine the entitiesthat your new processor version extracts. If a label is inactive in theschema, the processor version is not extracting that label, even if thedocuments are labeled.

    3. On theTrain tab, clickView Label Stats andverify your test and training set. Documents that areauto-labeled,unlabeled, orunassigned are excluded from training and evaluation.

    4. ClickTrain new version.

      TheVersion Name defines thename field of theprocessorVersion.

      processor-training-and-evaluation-overview-3

    5. ClickStart training and wait for your new processor version to betrained and evaluated.

      You can monitor training progress on theManageVersions tab:

      processor-training-and-evaluation-overview-4

    6. Click theEvaluate & Test tab to see how well your new processor versionperformed on the test set. For more information, seeEvaluate processor version.

    Python

    For more information, see theDocument AIPython API reference documentation.

    To authenticate to Document AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

    fromtypingimportOptionalfromgoogle.api_core.client_optionsimportClientOptionsfromgoogle.cloudimportdocumentai# type: ignore# TODO(developer): Uncomment these variables before running the sample.# project_id = 'YOUR_PROJECT_ID'# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'# processor_id = 'YOUR_PROCESSOR_ID'# processor_version_display_name = 'new-processor-version'# train_data_uri = 'gs://bucket/directory/' # (Optional)# test_data_uri = 'gs://bucket/directory/' # (Optional)deftrain_processor_version_sample(project_id:str,location:str,processor_id:str,processor_version_display_name:str,train_data_uri:Optional[str]=None,test_data_uri:Optional[str]=None,)->None:# You must set the api_endpoint if you use a location other than 'us', e.g.:opts=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")client=documentai.DocumentProcessorServiceClient(client_options=opts)# The full resource name of the processor# e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}parent=client.processor_path(project_id,location,processor_id)processor_version=documentai.ProcessorVersion(display_name=processor_version_display_name)# If train/test data is not supplied, the default sets in the Cloud Console will be usedinput_data=documentai.TrainProcessorVersionRequest.InputData(training_documents=documentai.BatchDocumentsInputConfig(gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=train_data_uri)),test_documents=documentai.BatchDocumentsInputConfig(gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=test_data_uri)),)request=documentai.TrainProcessorVersionRequest(parent=parent,processor_version=processor_version,input_data=input_data)operation=client.train_processor_version(request=request)# Print operation detailsprint(operation.operation.name)# Wait for operation to completeresponse=documentai.TrainProcessorVersionResponse(operation.result())metadata=documentai.TrainProcessorVersionMetadata(operation.metadata)print(f"New Processor Version:{response.processor_version}")print(f"Training Set Validation:{metadata.training_dataset_validation}")print(f"Test Set Validation:{metadata.test_dataset_validation}")

    Deploy and use the processor version

    You can deploy and manage your processor versions just like any other processorversion. For more information, seeManaging processor versions.

    Once deployed, you canSend a processing request to your customprocessor.

    Disable or delete a processor

    If you no longer want to use a processor, you can disable or delete it. If youdisable a processor, you can re-enable it. If you delete a processor, you cannotrecover it.

    1. In theDocument AI panel on the left, clickMy processors.

    2. Click the vertical dots to the right of the processor name. ClickDisable processor orDelete processor.

    For more information, seeManaging processor versions.

    Encryption of training data

    Document AI training data is saved in Cloud Storage and can beencrypted withCustomer-managed encryption keysif required.

    Deletion of training data

    After a Document AI training job is completed, all training data savedin Cloud Storage expire after a two-day retention period. Subsequent datadeletion activities respect the process described inData deletion on Google Cloud.

    Pricing

    There is no cost for training or up-training. You pay for hosting and prediction.For more information, seeDocument AI Pricing.

    Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

    Last updated 2026-02-19 UTC.