Label documents

A labeled dataset of documents is required to train, up-train, or evaluate aprocessor version.

This page describes how to apply labels from your processor schema to importeddocuments in your dataset.

This page assumes you have alreadycreated a processor that supportstraining, up-training, or evaluation. If your processor is supported, you nowsee theTrain tab in the Google Cloud console. It also assumes you havecreated a dataset, imported documents, and defined a processorschema.

Name fields for generative AI extraction

The way fields are named influences how accurately fields are extracted usinggenerative AI. We recommend the following best practices when naming fields:

Name the field with the same language used to describe it in the document:For example, if a document has a field described asEmployer Address, then namethe fieldemployer_address. Don't use abbreviations such asemplr_addr.
Spaces are currently not supported in field names: Instead of using spaces,use_. For example:First Name would be namedfirst_name.
Iterate on names to improve accuracy: Document AI has alimitation that does not allow field names to change. To test different names,use therenaming entity name toolto update the old entity's name with a newer one in the dataset, import the dataset,enable the new entities in the processor, and disable or delete the existing fields.

Zero-shot and few-shot learning

Models with Gemini have zero-shot and few-shot learning, which can createhigh-performing models with little to no training data.

Zero-shot learning is a machine learning example where a pre-trained model withoutany up-training learns to recognize and classify classes and entities which ithasn't encountered before during testing.

Few-shot learning is a where a model learns to recognize and classify new classesand entities with only a few training examples per class. It leverages knowledgefrom pre-trained models on large, well-labeled datasets to improve performance onfew-shot tasks.

Few-shot becomes more effective when the training dataset is tidy and carefullylabeled. Typically, this means having at least 10 testing and 10 trainingexamples available for the model to learn from.

Labeling options

Here are your options for labeling documents:

Manual: manually label your documents in the Google Cloud console
Auto-labeling: use an existing processor version to generatelabels
Import pre-labeled documents: save time if you alreadyhave labeled documents

Manually label in the Google Cloud console

In theTrain tab, select a document to open the labeling tool.

From the list of schema labels on the left side of the labeling tool, select the'Add' symbol to select theBounding box tool to highlight entities in thedocument and assign them to a label.

In the following screenshot, theEMPL_SSNEMPLR_ID_NUMBER,EMPLR_NAME_ADDRESS,FEDERAL_INCOME_TAX_WH,SS_TAX_WH,SS_WAGES, andWAGES_TIPS_OTHER_COMPfields in the document have been assigned labels.

label-process-1

When you select a checkbox entity with theBounding box tool, only select thecheckbox itself, and not any associated text. Ensure that the checkbox entityshown on the left is either selected or deselected to match what is in thedocument.

label-process-2

When you label parent-child entities, don't label the parent entities. The parententities are just containers of the child entities. Only label the child entities.The parent entities are updated automatically.

When you label child entities, label the first child entity and then associatethe related child entities with that line. You notice this at the secondchild entity the first time you label such entities. For example, with an invoice,if you labeldescription, it seem like any other entity. However, if youlabelquantity next, you are prompted to pick the parent.

Repeat this step for each line item by selectingNew Parent Entity for eachnew line item.

Parent-child entities are supported for tables with up to three layers of nesting.Foundation models support three tiers of fields (grandparent, parent, child), sochild entities can have one level of children. To learn more about nesting, refertoThree-level nesting.

Quick tables

When labeling a table, it could be tedious to label each row over and over again.There is a very convenient tool that can replicate a row entity structure.Note that, this feature only works on horizontally aligned rows.

First, label the first row as usual.
Then, hold the pointer over the parent entity representing the row. SelectAdd more rows. The row becomes a template to create more rows.
Select the rest of the area of the table.

The tool guesses the annotations, and it usually works. For any tables it can'thandle, annotate those manually.

Use keyboard shortcuts in console

To see the keyboard shortcuts that are available, select the menuat the upper right of the labeling console. It displays a list of keyboard shortcuts,as shown in the following table.

Action	Shortcut
Zoom in	`Alt + =` (`Option + =` on macOS)
Zoom out	`Alt + -` (`Option + -` on macOS)
Zoom to fit	`Alt + 0` (`Option + 0` on macOS)
Scroll to zoom	`Alt + Scroll` (`Option + Scroll` on macOS)
Panning	`Scroll`
Reversed panning	`Shift + Scroll`
Drag to pan	`Space + Mouse drag`
Undo	`Ctrl + Z` (`Control + Z` on macOS)
Redo	`Ctrl + Shift + Z` (`Control + +Shift + Z` on macOS)

Auto-label

If available, you can use an existing version of your processor to start labeling.

Note: Auto-labeling can populate labels only if the processor version supportsthat label. The following introduces two methods to initiate the auto-labelingprocess.Caution: Schema compliance isn't enforced during auto-labeling.You must label all instances of each entity for training purposes.

Auto-label can be initiated duringimport. All documents are annotatedusing the specified processor version.
Auto-label can be initiated afterimport for documents in the unlabeled orauto-labeled category. All selected documents are annotated using the specifiedprocessor version.

You can't train or up-train on auto-labeled documents, or use them in the test set,without marking them as labeled. Manually review and correct the auto-labeledannotations, then selectMark as Labeled to save the corrections. You can thenassign the documents as appropriate.

Import pre-labeled documents

You can import JSON Documentfiles. If theentity in the document matches the label in the processorschema, theentity is converted to a label instance by the importer. There areseveral ways you can get JSON Document files:

Exporting a dataset from another processor. SeeExport dataset.
Sending a processing request to anexisting processor.
Use theimport toolkitto convert existing labels from another system, for example, CSV format labelto JSON documents.

Best practices for labeling documents

Consistent labeling is required to train a high quality processor. We recommendthat you:

Create labeling instructions: Your instructions should include examplesfor both the common and corner cases. Some tips:
- Explain which fields should be annotated and how exactly to makelabeling consistent. For example, when labeling "amount", specifywhether the currency symbol should be labeled. If the labels are notconsistent, then processor quality is reduced.
- Label all occurrences of an entity, even if the label type isREQUIRED_ONCE orOPTIONAL_ONCE. For example, ifinvoice_id appearstwo times in the doc, label all occurrences of them.
- Generally it is preferred to label with the default bounding box toolfirst. If that fails, then use the select text tool.
- If the value of the label is not correctly detected by OCR, don'tmanually correct the value. That would render it unusable for training purposes.

Here are some sample labeling instructions:

Train annotators: make sure that annotators understand and can followthe guidelines without any systematic errors. One way to achieve this is tohave different trainees annotate the same set of documents. The trainer canthen check the quality of each trainee's annotation work. You might need torepeat this process until the trainees achieve a benchmark level ofaccuracy.
Initial reviews: The first few (10 or so) documents labeled for a usecase by a new labeler should be reviewed before large numbers of documentsare labeled to prevent a large number of mistakes that need to be corrected.
Annotation quality reviews: Given the laborious nature of annotation, eventrained annotators may make mistakes. We recommend that annotations arechecked by at least one more trained annotator.

Add a description prompt

When adding labels to the schema in custom extractor and custom classifier, youcan add a description for the label. This helps train the processor by providinga prompt with which to identify the label. You can try slight variations to testresponse quality. For example, "total amount", "total invoice amount", or "totalamount of invoice".

Tip: These descriptions can be used "reactively". If a label is oftenmistaken for a similar one, you can mitigate the false positives with descriptionsfor each label clarifying their differences.

Resync dataset

Resync keeps your dataset's Cloud Storage folder consistent withDocument AI's internal index of metadata. This is useful if you'veaccidentally made changes to the Cloud Storage folder and would like tosynchronize the data.

To resync:

In theProcessor Details tab, next to theStorage locationrow, select and then selectRe-sync Dataset.

Usage notes:

If you delete a document from the Cloud Storage folder, resyncremoves it from the dataset.
If you add a document to the Cloud Storage folder, resync does not add itto the dataset. To add documents, import them.
If you modify document labels in the Cloud Storage folder, resyncupdates the document labels in the dataset.

Migrate dataset

Import and export lets you move all the documents in a dataset from oneprocessor to another. This can be useful if you have processors in different regions or Google Cloud projects, if you havedifferent processors for staging and production, or for general offlineconsumption.

Note that only the documents and their labels are exported. Dataset metadata, suchas processor schema, document assignments (training/test/unassigned), anddocument labeling status (labeled, unlabeled, auto-labeled) arenot exported.

Copying and importing the dataset and then training the target processor is notexactly the same as training the source processor. This is because random values are usedat the beginning of the training process. Use theimportProcessorVersion APIcall to import-migrate the exact same model between projects. This is bestpractice for migration of processors to higher environments (for example developmentto staging to production) if policies allow.

Export dataset

To export all documents as JSONDocument files to a Cloud Storage folder,selectExport Dataset.

A few important things to note:

During export, three sub-folders are created:Test,Train, andUnassigned. Your documents are placed into those sub-folders accordingly.
A document's labeling status is not exported. If you later import thedocuments, they won't be markedauto-labeled.
If your Cloud Storage is in a different Google Cloud project, make sureto grant access so that Document AI is allowed to write files tothat location. Specifically, you must grant theStorage Object Creator role toDocument AI's core service agentservice-{project-id}@gcp-sa-prod-dai-core.iam.gserviceaccount.com. Formore information, seeService agents.

Import dataset

The procedure is the same asImport documents.

Selective labeling user guide

Selective labeling helps with recommendations on which documents to label. Youcan create diverse training and test datasets to train representative models. Eachtime selective labeling is performed, the most diverse (up to 30) documents fromthe dataset is selected.

Getting suggested documents

Create a CDE processor and import documents.
- At least 100 are required for training (25 for testing).
- Once sufficient documents are imported and after selective labeling, theinformation bar should appear.
In case of a CDE processor with zero suggested documents, import more to havesufficient documents in either split for sampling.
- This should enable the suggested documents inSuggested category. Youshould be able to request suggested documents manually.
- There's a new filter on top to filter out suggested documents.

Label suggested documents

Go toSuggested category on the left-hand label list panel. Start labelingthese documents.
SelectAuto-label on the information bar if the processor is trained.Label the suggested documents.
You can then selectReview now on the bar when you have suggested documentsin the processor to navigate to. All auto-labeled documents should be reviewedfor accuracy. Start reviewing.

Train after labeling all suggested documents

Move toTrain now on the information bar. When the suggested documents arelabeled, you should see the following information bar recommending training.

label-process-15

Supported features and limitations

Feature	Description	Supported
Support for old processors	Might not work well with old processors with previously imported dataset

Create dataset

Train and evaluate

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.

Movatterモバイル変換

Label documents Stay organized with collections Save and categorize content based on your preferences.