Template-based extraction

You can train a high-performing model with as little as three training and three testdocuments for fixed-layout use cases. Accelerate development and reduce time toproduction for templated document types like W9, 1040, ACORD, surveys, and questionnaires.

Dataset configuration

A document dataset is required to train, up-train, or evaluate a processor version.Document AI processors learn from examples, just like humans. Dataset fuelsprocessor stability in terms of performance.

Train dataset

To improve the model and its accuracy, train a dataset on your documents. The model ismade up of documents with ground-truth. You need a minimum of three documents to train a new model.Ground-truth is the correctly labeled data, as determined by humans.

Test dataset

The test dataset is what the model uses to generate an F1 score (accuracy). It ismade up of documents with ground-truth. To see how often the model is right, theground truth is used to compare the model's predictions (extracted fields fromthe model) with the correct answers. The test dataset should have at least three documents.

Before you begin

If not already done, enable:

Template-mode labeling best practices

Proper labeling is one of the most important steps to achieving high accuracy.Template mode has some unique labeling methodology that differs from other training modes:

  • Draw bounding boxes around the entire area you expect data to be in (per label)within a document, even if the label is empty in the training document you're labeling.
  • You may label empty fields for template-based training. Don't label empty fieldsfor model-based training.
Recommended. Labeling example for template-based training to extract the top section of a 1040.template-based-extraction-1Not recommended. Labeling example for template-based training to extract the top section of a 1040. This is the labeling technique you should use for model-based training for documents with layout variation across documents.template-based-extraction-2

Build and evaluate a custom extractor with template mode

  1. Create a custom extractor.Create a processoranddefine fieldsyou want to extract followingbest practices,which is important because it impacts extraction quality.

  2. Set dataset location. Select the default option folder (Google-managed). Thismight be done automatically shortly after creating the processor.

  3. Navigate to theBuild tab and selectImport documents with auto-labelingenabled. Adding more documents than the minimum of three needed typically doesn't improve quality fortemplate-based training. Instead of adding more, focus on labeling a small set very accurately.

    Note: You can experiment by increasing the training set size if you observetemplate variations in your dataset. Try to include at least three training documentsper variation. At least three training documents, three test documents, and threeschema labels are required per set.
  4. Extend bounding boxes. These boxes for template mode should look like the precedingexamples. Extend the bounding boxes, following the best practices for the optimal result.

  5. Train model.

    1. SelectTrain new version.
    2. Name the processor version.
    3. Go toShow advanced options and select the template-based model approach.

    template-based-extraction-3

    Note: It takes some time for the training to complete.
  6. Evaluation.

    1. Go toEvaluate & test.
    2. Select the version you just trained, then selectView Full Evaluation.

    template-based-extraction-4

    You now see metrics such as F1, precision, and recall for the entire document and each field.1. Decide if performance meets your production goals, and if not, reevaluate training and testing sets.

  7. Set a new version as the default.

    1. Navigate toManage versions.
    2. Select to see the settings menu, then markSet as default.

    template-based-extraction-5

    Your model is now deployed and documents sent to this processor use your customversion. You want to evaluate the model's performance (more detailson how to do that) to check if it requires further training.

Evaluation reference

The evaluation engine can do both exact match orfuzzy matching.For an exact match, the extracted value must exactly match the ground truth or is counted as a miss.

Fuzzy matching extractions that had slight differences such as capitalizationdifferences still count as a match. This can be changed at theEvaluation screen.

template-based-extraction-6

Auto-labeling with the foundation model

The foundation model can accurately extract fields for a variety of document types,but you can also provide additional training data to improve the accuracy of themodel for specific document structures.

Document AI uses the label names you define and previous annotations to makeit quicker and easier to label documents at scale with auto-labeling.

  1. After creating a custom processor, go to theGet started tab.
  2. SelectCreate New Field.

    Note: The label name with the foundation model can greatly affect model accuracyand performance. Be sure to give a descriptive name.

    template-based-extraction-7

  3. Navigate to theBuild tab and then selectImport documents.

    template-based-extraction-8

  4. Select the path of the documents and which set the documents should be importedinto. Check the auto-labeling checkbox and select the foundation model.

  5. In theBuild tab, selectManage dataset. You should see your importeddocuments. Select one of your documents.

    template-based-extraction-9

  6. You see the predictions from the model highlighted in purple, you need to revieweach label predicted by the model and ensure it's correct. If there are missingfields, you need to add those as well.

    Note: It's important that all fields are as accurate as possible or modelperformance is going to be affected. For moredetails on labeling.

    template-based-extraction-10

  7. Once the document has been reviewed, selectMark as labeled.

  8. The document is now ready to be used by the model. Make sure the document isin either the testing or training set.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-02-19 UTC.