Evaluate performance

Document AI generates evaluation metrics, such as precision and recall, to helpyou determine the predictive performance of your processors.

These evaluation metrics are generated by comparing the entities returned by theprocessor (the predictions) against the annotations in the test documents. Ifyour processor does not have a test set, then you must firstcreate a dataset andlabel the test documents.

Run an evaluation

An evaluation is automatically run whenever you train or uptrain a processorversion.

You can also manually run an evaluation. This is required to generate updatedmetrics after you've modified the test set, or if you are evaluating apretrained processor version.

Note: Document AI cannot and does not calculate evaluation metrics for a labelif the processor version cannot extract that label (for example, the label wasdisabled at the time of training) or if the test set does not includeannotations for that label. Such labels are not included in aggregated metrics.

Web UI

In the Google Cloud console, go to theProcessors page and choose yourprocessor.
Go to the Processors page
In theEvaluate & Test tab, select theVersion of the processor toevaluate and then clickRun new evaluation.

Once complete, the page contains evaluation metrics for all labels and for eachindividual label.

Python

For more information, see theDocument AIPython API reference documentation.

To authenticate to Document AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

fromgoogle.api_core.client_optionsimportClientOptionsfromgoogle.cloudimportdocumentai# type: ignore# TODO(developer): Uncomment these variables before running the sample.# project_id = 'YOUR_PROJECT_ID'# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'# processor_id = 'YOUR_PROCESSOR_ID'# processor_version_id = 'YOUR_PROCESSOR_VERSION_ID'# gcs_input_uri = # Format: gs://bucket/directory/defevaluate_processor_version_sample(project_id:str,location:str,processor_id:str,processor_version_id:str,gcs_input_uri:str,)->None:# You must set the api_endpoint if you use a location other than 'us', e.g.:opts=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")client=documentai.DocumentProcessorServiceClient(client_options=opts)# The full resource name of the processor version# e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`name=client.processor_version_path(project_id,location,processor_id,processor_version_id)evaluation_documents=documentai.BatchDocumentsInputConfig(gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri))# NOTE: Alternatively, specify a list of GCS Documents## gcs_input_uri = "gs://bucket/directory/file.pdf"# input_mime_type = "application/pdf"## gcs_document = documentai.GcsDocument(#     gcs_uri=gcs_input_uri, mime_type=input_mime_type# )# gcs_documents = [gcs_document]# evaluation_documents = documentai.BatchDocumentsInputConfig(#     gcs_documents=documentai.GcsDocuments(documents=gcs_documents)# )#request=documentai.EvaluateProcessorVersionRequest(processor_version=name,evaluation_documents=evaluation_documents,)# Make EvaluateProcessorVersion request# Continually polls the operation until it is complete.# This could take some time for larger filesoperation=client.evaluate_processor_version(request=request)# Print operation details# Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_IDprint(f"Waiting for operation{operation.operation.name} to complete...")# Wait for operation to completeresponse=documentai.EvaluateProcessorVersionResponse(operation.result())# After the operation is complete,# Print evaluation ID from operation responseprint(f"Evaluation Complete:{response.evaluation}")

Get results of an evaluation

Web UI

In the Google Cloud console, go to theProcessors page and choose yourprocessor.
Go to the Processors page
In theEvaluate & Test tab, select theVersion of the processor to view evaluation.

Once complete, the page contains evaluation metrics for all labels and for eachindividual label.

Python

For more information, see theDocument AIPython API reference documentation.

To authenticate to Document AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

fromgoogle.api_core.client_optionsimportClientOptionsfromgoogle.cloudimportdocumentai# type: ignore# TODO(developer): Uncomment these variables before running the sample.# project_id = 'YOUR_PROJECT_ID'# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'# processor_id = 'YOUR_PROCESSOR_ID' # Create processor before running sample# processor_version_id = 'YOUR_PROCESSOR_VERSION_ID'# evaluation_id = 'YOUR_EVALUATION_ID'defget_evaluation_sample(project_id:str,location:str,processor_id:str,processor_version_id:str,evaluation_id:str,)->None:# You must set the api_endpoint if you use a location other than 'us', e.g.:opts=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")client=documentai.DocumentProcessorServiceClient(client_options=opts)# The full resource name of the evaluation# e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`evaluation_name=client.evaluation_path(project_id,location,processor_id,processor_version_id,evaluation_id)# Make GetEvaluation requestevaluation=client.get_evaluation(name=evaluation_name)create_time=evaluation.create_timedocument_counters=evaluation.document_counters# Print the Evaluation Information# Refer to https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors.processorVersions.evaluations# for more information on the available evaluation dataprint(f"Create Time:{create_time}")print(f"Input Documents:{document_counters.input_documents_count}")print(f"\tInvalid Documents:{document_counters.invalid_documents_count}")print(f"\tFailed Documents:{document_counters.failed_documents_count}")print(f"\tEvaluated Documents:{document_counters.evaluated_documents_count}")

List all evaluations for a processor version

Python

For more information, see theDocument AIPython API reference documentation.

To authenticate to Document AI, set up Application Default Credentials. For more information, seeSet up authentication for a local development environment.

fromgoogle.api_core.client_optionsimportClientOptionsfromgoogle.cloudimportdocumentai# type: ignore# TODO(developer): Uncomment these variables before running the sample.# project_id = 'YOUR_PROJECT_ID'# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'# processor_id = 'YOUR_PROCESSOR_ID' # Create processor before running sample# processor_version_id = 'YOUR_PROCESSOR_VERSION_ID'deflist_evaluations_sample(project_id:str,location:str,processor_id:str,processor_version_id:str)->None:# You must set the api_endpoint if you use a location other than 'us', e.g.:opts=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")client=documentai.DocumentProcessorServiceClient(client_options=opts)# The full resource name of the processor version# e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`parent=client.processor_version_path(project_id,location,processor_id,processor_version_id)evaluations=client.list_evaluations(parent=parent)# Print the Evaluation Information# Refer to https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors.processorVersions.evaluations# for more information on the available evaluation dataprint(f"Evaluations for Processor Version{parent}")forevaluationinevaluations:print(f"Name:{evaluation.name}")print(f"\tCreate Time:{evaluation.create_time}\n")

Evaluation metrics for all labels

evaluate-the-performance-of-processors-1

Metrics forAll labels are computed based on the number of true positives,false positives, and false negatives in the dataset across all labels, and thus,are weighted by the number of times each label appears in the dataset. Fordefinitions of these terms, seeEvaluation metrics for individual labels.

Precision: the proportion of predictions that match the annotations inthe test set. Defined asTrue Positives / (True Positives + FalsePositives)
Recall: the proportion of annotations in the test set that are correctlypredicted. Defined asTrue Positives / (True Positives + False Negatives)
F1 score: the harmonic mean of precision and recall, which combinesprecision and recall into a single metric, providing equal weight to both.Defined as2 * (Precision * Recall) / (Precision + Recall)

Note: Document AI does not provide a metric for Accuracy. The accuracy metric,often defined as the proportion of instances that are predicted correctly, isless meaningful because 1) not all labels appear in the test set (for example,optional fields) and 2) there may be multiple values for a single label (forexample, line items in an invoice). F1, on the other hand, can be consideredroughly equivalent to accuracy, but it more meaningfully accommodates thesescenarios.

Evaluation metrics for individual labels

True Positives: the predicted entities that match an annotation in thetest document. For more information, see matching behavior.
False Positives: the predicted entities that don't match any annotationin the test document.
False Negatives: the annotations in the test document that don't matchany of the predicted entities.
- False Negatives (Below Threshold): the annotations in the testdocument that would have matched a predicted entity, but the predictedentity's confidence value is below the specified confidence threshold.

Note: False positives and false negatives are not mutually exclusive. Forexample, if a predicted entity has a corresponding, albeit incorrect, annotationin the test set, then there is one false positive for the predicted entityand one false negative for the annotation. A predicted entity without anannotation is only a false positive. An annotation without an associatedprediction is only a false negative.Tip: If the quality ofcheckbox extraction is not high enough, considerrunning the documents through the Form Parser.The Form Parser extracts information that could potentially boost the quality ofcheckbox extraction. Specifically, Form Parser can automatically draw boundingboxes around checkboxes whereas manually drawn bounding boxes might be lessconsistent. To do this, process your documents using the Form Parser, thenimport the processed documents and annotate them again.

Confidence threshold

The evaluation logic ignores any predictions with confidence below the specifiedConfidence Threshold, even if the prediction is correct. Document AIprovides a list ofFalse Negatives (Below Threshold), which are theannotations that would have a match if the confidence threshold were set lower.

Document AI automatically computes theoptimal threshold, which maximizesthe F1 score, and by default, sets the confidence threshold to this optimalvalue.

Note: If there are multiple optimal threshold values (with the same maximal F1score), the lowest value is chosen.

You are free to choose your own confidence threshold by moving the slider bar.In general, higher confidence threshold results in:

higher precision, because the predictions are more likely to be correct.
lower recall, because there are fewer predictions.

Tabular entities

The metrics for a parent label are not calculated by directly averaging thechild metrics, but rather, by applying the parent's confidence threshold to allof its child labels and aggregating the results.

The optimal threshold for the parent is the confidence threshold value that, when applied to allchildren, yields the maximum F1 score for the parent.

Matching behavior

A predicted entity matches an annotation if:

the type of the predicted entity(entity.type)matches the annotation's label name
the value of the predicted entity(entity.mention_textorentity.normalized_value.text)matches the annotation's text value, subject tofuzzy matching if it is enabled.

Note that type and text value are all that is used for matching. Otherinformation, such as text anchors and bounding boxes (with the exception oftabular entities described below) are not used.

Note: Some Identity processorsextract aPortrait entity, which consists of no text and only a bounding boxthat indicates where a person's portrait is in an identity document. Because itlacks text, portrait labels and entities are not evaluated.

Single- versus multi-occurrence labels

Single-occurrence labels have one value per document (for example, invoice ID)even if that value is annotated multiple times in the same document (forexample, the invoice ID appears in every page of the same document). Even if themultiple annotations have different text, they are considered equal. In otherwords, if a predicted entity matches any of the annotations, it counts as amatch. The extra annotations are consideredduplicate mentions and don'tcontribute towards any of the true positive, false positive, or false negativecounts.

Multi-occurrence labels can have multiple, different values. Thus, eachpredicted entity and annotation is considered and matched separately. If adocument contains N annotations for a multi-occurrence label, then there can beN matches with the predicted entities. Each predicted entity and annotation areindependently counted as a true positive, false positive, or false negative.

Fuzzy Matching

TheFuzzy Matching toggle lets you tighten or relax some of the matchingrules to decrease or increase the number of matches.

For example, without fuzzy matching, the stringABC does not matchabc dueto capitalization. But with fuzzy matching, they match.

When fuzzy matching is enabled, here are the rule changes:

Whitespace normalization: removes leading-trailing whitespace andcondenses consecutive intermediate whitespaces (including newlines) intosingle spaces.
Leading/trailing punctuation removal: removes the followingleading-trailing punctuation characters!,.:;-"?|.
Case-insensitive matching: converts all characters to lowercase.
Money normalization: For labels with the data typemoney, remove theleading-trailing currency symbols.

Note: You cannot use fuzzy matching on numeric values. For example,1 and1.00don't match, even when fuzzy matching is enabled.

Tabular entities

Parent entities and annotations don't have text values and are matched based onthe combined bounding boxes of their children. If there is only one predictedparent and one annotated parent, they are automatically matched, regardless ofbounding boxes.

Once parents are matched, their children are matched as if they were non-tabularentities. If parents are not matched, Document AI won't attempt to matchtheir children. This means that child entities can be consideredincorrect, even with the same text contents, if their parent entities are notmatched.

Parent / child entities are a Preview feature and only supported for tables with one layer of nesting.

Export evaluation metrics

In the Google Cloud console, go to theProcessors page and choose yourprocessor.
Go to the Processors page
In theEvaluate & Test tab, clickDownload Metrics, to download theevaluation metrics as a JSON file.

Note: The exported metrics are based on the value of the Confidence threshold slider at the time of export.

Monitoring dashboard

The monitoring dashboard in the Google Cloud console provides a useful way to createyour own monitoring visualizations for different metrics and resources used inDocument AI processors.

List of metrics and their fields:Number of successfully processed pages andTotal number of docs sent to processing both have the following fields:location,processor type,processor_id,processor_version_id,status,andprocessing_type.

Sync processing latency,Batch processing latency per document, andBatch processing operation count have the following fields:location,processor_type,processor_id,processor_version_id, andstatus.

Create a monitoring view

You can monitor your processors individually and collectively, with a perprocessor view and a per project view. The steps for these are the same.

In the Google Cloud console, in the Document AI section, go to theProcessors page.
Go to the Processors page
Optional: If you want to monitor a specific processor, select it from the list.
Use the navigation pane to select theMonitoringmonitoring option.
Set theTime range.
Fill the checkboxes for the selected metric and resource labels, and selectOk.
Tip: You can hold the pointer over the trendlines and series legends to highlight them.
Optional: You can see more options for each graph with theMenu

Up-train a pretrained processor

Base64 encoding

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Evaluate performance Stay organized with collections Save and categorize content based on your preferences.

Run an evaluation

Web UI

Python

Get results of an evaluation

Web UI

Python

List all evaluations for a processor version

Python

Evaluation metrics for all labels

Evaluation metrics for individual labels

Confidence threshold

Tabular entities

Matching behavior

Single- versus multi-occurrence labels

Fuzzy Matching

Tabular entities

Export evaluation metrics

Monitoring dashboard

Create a monitoring view

Evaluate performance