Document AI Workbench - Uptraining

    1. Introduction

    Document AI is a document understanding solution that takes unstructured data, such as documents, emails, and so on, and makes the data easier to understand, analyze, and consume.

    By using uptraining through Document AI Workbench, you can achieve higher document processing accuracy by providing additional labeled examples for Specialized Document Types and creating a new model version.

    In this lab, you will create an Invoice Parser processor, configure the processor for uptraining, label example documents, and uptrain the processor.

    The document dataset used in this lab consists of randomly-generated invoices for a fictional piping company.

    NOTE: Document AI Uptraining is currently in Preview, and the Console UI may change over time, so your environment may look slightly different. If you find any issues with this lab, pleasereport them.

    Prerequisites

    This codelab builds upon content presented in other Document AI Codelabs.

    It is recommended that you complete the following Codelabs before proceeding.

    What you'll learn

    • Configure Uptraining for an Invoice Parser processor.
    • Label Document AI training data using the annotation tool.
    • Train a new model version.
    • Evaluate the accuracy of the new model version.

    What you'll need

    2. Getting set up

    This codelab assumes you have completed the Document AI Setup steps listed in theIntroductory Codelab.

    Please complete the following steps before proceeding:

    3. Create a Processor

    You must first create an Invoice Parser processor to use for this lab.

    1. In the console, navigate to theDocument AI Overview page.

    docai-uptraining-codelab-01

    1. ClickCreate Processor, scroll down toSpecialized (or type"Invoice Parser" in the search bar) and selectInvoice Parser.

    docai-uptraining-codelab-02

    1. Give it the namecodelab-invoice-uptraining (Or something else you'll remember) and select the closest region on the list.

    docai-uptraining-codelab-03

    1. ClickCreate to create your processor. You should then see the Processor Overview page.

    docai-uptraining-codelab-04

    4. Create a Dataset

    In order to train our processor, we will have to create a dataset with training and testing data to help the processor identify the entities we want to extract.

    You will need to create a new bucket inCloud Storage to store the dataset. Note: This should not be the same bucket where your documents are currently stored.

    1. OpenCloud Shell and run the following commands to create a bucket. Alternatively,create a new bucket in the Cloud Console. Save this bucket name, you will need it later.
    exportPROJECT_ID=$(gcloudconfigget-valueproject)gsutilmb-p$PROJECT_ID"gs://${PROJECT_ID}-uptraining-codelab"
    1. Go to theDataset tab, and click onCreate Dataset

    docai-uptraining-codelab-05

    1. Paste the bucket name from the bucket you created in step one into theDestination Path field. (Don't includegs://)

    docai-uptraining-codelab-06

    1. Wait for the dataset to be created, then it should direct you to the Dataset management page.

    docai-uptraining-codelab-07

    5. Import a Test Document

    Now, let's import a sample invoice pdf into our dataset.

    1. Click onImport Documents

    docai-uptraining-codelab-08

    1. We have a sample PDF for you to use in this lab. Copy and paste the following link into theSource Path box. Leave the "Data split" as "Unassigned" for now. ClickImport.
    cloud-samples-data/documentai/codelabs/uptraining/pdfs

    docai-uptraining-codelab-09

    1. Wait for the document to import. This took less than 1 minute in my tests.

    docai-uptraining-codelab-10

    1. When the import completes, you should see the document in the Dataset management UI. Click on it to enter the labeling console.

    docai-uptraining-codelab-11

    6. Label the Test Document

    Next, we will identify text elements and labels for the entities we would like to extract. These labels will be used to train our model to parse this specific document structure and identify the correct types.

    1. You should now be in the labeling console, which will look something like this.

    docai-uptraining-codelab-12

    1. Click on the "Select Text" Tool, then highlight the text "McWilliam Piping International Piping Company" and assign the labelsupplier_name. You can use the text filter to search for label names.

    docai-uptraining-codelab-13

    1. Highlight the text "14368 Pipeline Ave Chino, CA 91710" and assign the labelsupplier_address.

    docai-uptraining-codelab-14

    1. Highlight the text "10001" and assign the labelinvoice_id.

    docai-uptraining-codelab-15

    1. Highlight the text "2020-01-02" and assign the labeldue_date.

    docai-uptraining-codelab-16

    1. Switch to the "Bounding Box" tool. Highlight the text "Knuckle Couplers" and assign the labelline_item/description.

    docai-uptraining-codelab-17

    1. Highlight the text "9" and assign the labelline_item/quantity.

    docai-uptraining-codelab-18

    1. Highlight the text "74.43" and assign the labelline_item/unit_price.

    docai-uptraining-codelab-19

    1. Highlight the text "669.87" and assign the labelline_item/amount.

    docai-uptraining-codelab-20

    1. Repeat the previous 4 steps for the next two line items. It should look like this when complete.

    docai-uptraining-codelab-21

    1. Highlight the text "1,419.57" (next to Subtotal) and assign the labelnet_amount.

    docai-uptraining-codelab-22

    1. Highlight the text "113.57" (next to Tax) and assign the labeltotal_tax_amount.

    docai-uptraining-codelab-23

    1. Highlight the text "1,533.14" (next to Total) and assign the labeltotal_amount.

    docai-uptraining-codelab-24

    1. Highlight one of the "$" characters and assign the labelcurrency.

    docai-uptraining-codelab-25

    1. The labeled document should look like this when complete. Note, you can make adjustments to these labels by clicking on the bounding box in the document or the label name/value on the left side menu. ClickSave when you are finished labeling.

    docai-uptraining-codelab-26

    1. Here is the full list of labels and values

    Label Name

    Text

    supplier_name

    McWilliam Piping International Piping Company

    supplier_address

    14368 Pipeline Ave Chino, CA 91710

    invoice_id

    10001

    due_date

    2020-01-02

    line_item/description

    Knuckle Couplers

    line_item/quantity

    9

    line_item/unit_price

    74.43

    line_item/amount

    669.87

    line_item/description

    PVC Pipe 12 Inch

    line_item/quantity

    7

    line_item/unit_price

    15.90

    line_item/amount

    111.30

    line_item/description

    Copper Pipe

    line_item/quantity

    7

    line_item/unit_price

    91.20

    line_item/amount

    638.40

    net_amount

    1,419.57

    total_tax_amount

    113.57

    total_amount

    1,533.14

    currency

    $

    7. Assign Document to Training Set

    You should now be back at the Dataset management console. Notice that the number of Labeled and Unlabeled documents have changed as well as the numbers of active labels.

    docai-uptraining-codelab-27

    1. We need to assign this document to either the "Training" or "Test" set. Click on the Document.

    docai-uptraining-codelab-28

    1. ClickAssign to Set, then click onTraining.

    docai-uptraining-codelab-29

    1. Notice the Data Split numbers have changed.

    docai-uptraining-codelab-30

    8. Import Pre-Labeled Data

    Document AI Uptraining requires a minimum of 10 documents in both the training and test sets, along with 10 instances of each label in each set.

    It's recommended to have at least 50 documents in each set with 50 instances of each label for best performance. More training data generally equates to higher accuracy.

    It will take a long time to manually label 100 documents, so we have some pre-labeled documents that you can import for this lab.

    You can import pre-labeled document files in theDocument.json format. These can be results from calling a processor and verifying the accuracy usingHuman in the Loop (HITL).

    NOTE: When importing pre-labeled data, it ishighly recommended to manually review annotations before a model is trained.

    1. Click onImport Documents.

    docai-uptraining-codelab-30

    1. Copy/Paste the following Cloud Storage path and assign it to theTraining set.
    cloud-samples-data/documentai/codelabs/uptraining/training
    1. Click onAdd Another Bucket. Then Copy/Paste the following Cloud Storage path and assign it to theTest set.
    cloud-samples-data/documentai/codelabs/uptraining/test

    docai-uptraining-codelab-31

    1. ClickImport and wait for the documents to import. This will take longer than last time because there are more documents to process. In my tests, this took about 6 minutes. You can leave this page and return later.

    docai-uptraining-codelab-32

    1. Once complete, you should see the documents in the Dataset management page.

    docai-uptraining-codelab-33

    9. Edit Labels

    The sample documents we are using for this example do not contain every label supported by the Invoice Parser. We will need to mark the labels we are not using as inactive before training. You can also follow similar steps to add a custom label before Uptraining.

    1. Click onManage Labels in the bottom-left corner.

    docai-uptraining-codelab-33

    1. You should now be in the Label Management console.

    docai-uptraining-codelab-34

    1. Use the Checkboxes and theDisable/Enable buttons to mark ONLY the following labels asEnabled.
      • currency
      • due_date
      • invoice_id
      • line_item/amount
      • line_item/description
      • line_item/quantity
      • line_item/unit_price
      • net_amount
      • supplier_address
      • supplier_name
      • total_amount
      • total_tax_amount
    2. The Console should look like this when complete. ClickSave when finished.

    docai-uptraining-codelab-35

    1. Click on the Back arrow to return to the Dataset management console. Notice that the labels with 0 instances have been marked as Inactive.

    docai-uptraining-codelab-36

    10. Optional: Auto-label newly imported documents

    When importing unlabeled documents for a processor with an existing deployed processor version, you can useAuto-labeling to save time on labeling.

    1. On theTrain page, ClickImport Documents.
    2. Copy and paste the following {{storage_name}} path. This directory contains 5 unlabeled invoice PDFs. From theData split dropdown list, selectTraining.
      cloud-samples-data/documentai/Custom/Invoices/PDF_Unlabeled
    3. In theAuto-labeling section, select theImport with auto-labeling checkbox.
    4. Select an existing processor version to label the documents.
    • For example:pretrained-invoice-v1.3-2022-07-15
    1. ClickImport and wait for the documents to import. You can leave this page and return later.
    • When complete, the documents appear in theTrain page in theAuto-labeled section.
    1. You cannot use auto-labeled documents for training or testing without marking them as labeled. Go to theAuto-labeled section to view the auto-labeled documents.
    2. Select the first document to enter the labeling console.
    3. Verify the labels, bounding boxes, and values to ensure they are correct. Label any values that were omitted.
    4. SelectMark as labeled when finished.
    5. Repeat the label verification for each auto-labeled document, then return to theTrain page to use the data for training.

    11. Uptrain the Model

    Now, we are ready to begin training our Invoice Parser.

    1. ClickTrain New Version

    docai-uptraining-codelab-36

    1. Give your version a name that you'll remember, such ascodelab-uptraining-test-1. The Base version is the model version this new version will be built from. If you're using a new processor, the only option should beGoogle Pretrained Next with Uptraining

    docai-uptraining-codelab-37

    1. (Optional) You can also selectView Label Stats to see metrics about the labels in your dataset.

    docai-uptraining-codelab-38

    1. Click onStart Training to begin the Uptraining process. You should be redirected to the Dataset management page. You can view the training status on the right side. Training will take a few hours to complete. You can leave this page and return later.

    docai-uptraining-codelab-39

    1. If you click on the version name, you will be directed to theManage Versions page, which shows the Version ID and the current status of the Training Job.

    docai-uptraining-codelab-40

    12. Test the New Model Version

    Once the Training Job is complete (it took about 1 hour in my tests), you can now test out the new model version and start using it for predictions.

    1. Go to theManage Versions page. Here you can see the current status and F1 Score.

    docai-uptraining-codelab-41

    1. We will need to deploy this model version before it can be used. Click on the vertical dots on the right side and selectDeploy Version.

    docai-uptraining-codelab-42

    1. SelectDeploy from the pop-up window, when wait for the version to deploy. This will take a few minutes to complete. After it's deployed, you can also set this version as the Default Version.

    docai-uptraining-codelab-43

    1. Once it's finished deploying, go to theEvaluate Tab. Then click on the Version dropdown and select our newly-created version.

    docai-uptraining-codelab-44

    1. On this page, you can view evaluation metrics including the F1 score, Precision and Recall for the full document as well as individual labels. You can read more about these metrics in theAutoML Documentation.
    2. Download the PDF File linked below. This is a sample document that was not included in the Training or Test set.

    1. Click onUpload Test Document and select the PDF file.

    docai-uptraining-codelab-45

    1. The extracted entities should look something like this.

    docai-uptraining-codelab-46

    13. Conclusion

    Congratulations, you've successfully used Document AI to uptrain an Invoice Parser. You can now use this processor to parse invoices just as you would for any Specialized Processor.

    You can refer to theSpecialized Processors Codelab to review how to handle the processing response.

    Cleanup

    To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

    • In the Cloud Console, go to theManage resources page.
    • In the project list, select your project then click Delete.
    • In the dialog, type the project ID and then click Shut down to delete the project.

    Resources

    License

    This work is licensed under a Creative Commons Attribution 2.0 Generic License.

    Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.