Posted onAug 26, 2023 • Originally published atblog.cloudbuff.in onAug 26, 2023

Automate invoice processing using AWS Textract

A routine task that every business must perform is to process invoices. Often, these invoices are received in paper form or as email attachments. Processing these invoices can - a) be tedious, b) consume a lot of time and c) be error-prone. This makes invoice processing a good use case for automation.

Overview of Textract

The primary challenge in processing invoices is extracting the relevant data. This is whereAmazon Textract can help. It is a service provided by Amazon Web Services (AWS) that uses advanced Machine Learning (ML) algorithms to automatically extract structured and unstructured data from scanned documents, images, and PDF files. It can detect typed and handwritten text in different types of documents including invoices, financial reports, medical records, tax forms etc. Under the hood, it uses Optical Character Recognition (OCR) technology and pre-trained ML models to understand and interpret the content of documents.

Textract Demo

Here is a quick demo to showcase how Textract can be used to extract relevant information from a scanned invoice.

As part of the AWS Free Tier, you can get started with Amazon
Textract for free. The Free Tier lasts for three months, and new > AWS customers can analyze up to 100 invoice pages per month.
Beyond this, there are charges for using Textract.

Let's jump to the AWS console and see how it works. Log into your AWS account and search forTextract in AWS console. ClickAmazon Textract in the search results.

This brings us to the Textract page. For this demo, we will use the US East N. Virginia region.

Click onTry Amazon Textract button. This navigates to theAnalyze Document screen. Since we want to analyze an invoice, click on theAnalyze Expense option in the menu on the left.

By default, this screen shows a sample document and the relevant information extracted by Textract. We can see the image of a Whole Foods receipt on the left and the extracted raw text on the right.

Scroll down and pick theInvoice option. Selecting this option updates the sample to an invoice document and the text extracted from it. We want to use Textract for our sample invoice, so clickChoose document to upload it.

We will upload an image from the local drive (download thisinvoice image).

Textract will take a few seconds to process the invoice once it is uploaded. Soon, we will see our invoice with the extracted text outlined.

On the right, we can see important summary fields like Vendor Name, Vendor Address, Reciever Name, Receiver Address, Invoice #, Invoice Date etc. Clicking onLine item fields shows the invoice line items.

The right panel shows that all the line item details have been correctly extracted. As you can see, Textract can detect and extract tables quite nicely. Another interesting feature is that if you click on any of the text items inResults, it will highlight the portion of the invoice from where the text item was extracted.

ClickDownload results button to download the information extracted from the invoice as a zip file.

The downloaded zip file has three files- two CSV and one JSON file.

FilesummaryFields-1.csv contains the extracted summary information such as the Vendor name, address, phone etc.

FilelineItemFields-1 contains the line item details extracted.

Note that it also shows the confidence scores for extract labels, column titles and values.

FileanalyzeExpenseResponse.json contains a lot more additional details including geometric information (position and size of extracted text) etc.

Conclusion

We saw in this demo that Amazon Textract goes beyond simple OCR to identify, understand, and extract data from forms and tables. The only information we provided was that the document is an invoice. We did not specify any labels, text or positions to extract. The use of pre-trained ML models makes Textract smart enough to automatically identify the appropriate labels and text relevant to an invoice.

In this article, we saw a quick demo of Textract using the AWS console. Textract provides a wide range of CLI and SDKs for Python, Java, NodeJS etc. In thenext article, we will use a Python script to programmatically extract data from the same invoice using Textract SDK.