- Notifications
You must be signed in to change notification settings - Fork36
A layer for AWS Lambda containing the tesseract C libraries and tesseract executable.
License
bweigel/aws-lambda-tesseract-layer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
AWS Lambda layer containing thetesseract OCR libraries and command-line binary for Lambda Runtimes running on Amazon Linux 1 and 2.
⚠️ The Amazon Linux AMI (Version 1) is being deprecated. Users are advised to not use Lambda runtimes (i.e. Python 3.6) based on this version. Refer also to theAWS Lambda runtime deprecation policy.
- Quickstart
- Ready-to-use binaries
- Build tesseract layer from source using Docker
- Known Issues
- Contributors ❤️
This repo comes with ready-to-use binaries compiled against the AWS Lambda Runtimes (based on Amazon Linux 1 and 2).Example Projects in Python 3.6 (& 3.8) using Serverless Framework and CDK are provided:
## Demo using Serverless Framework and prebuilt layercd example/serverlessnpm cinpx sls deploy## or ..## Demo using CDK and prebuilt layercd example/cdknpm cinpx cdk deploy
For compiled, ready to use binaries that you can put in your layer seeready-to-use
, or check out thelatest release.
Seeexamples for some ready-to-use examples.
Reference the path to the ready-to-use layer contents in yourserverless.yml
:
service:tesseract-ocr-layerprovider:name:aws# define layerlayers:tesseractAl2:# and path to contentspath:ready-to-use/amazonlinux-2compatibleRuntimes: -python3.8functions:tesseract-ocr:handler:...runtime:python3.8# reference layer in functionlayers: -{ Ref: TesseractAl2LambdaLayer }events: -http:path:ocrmethod:post
Deploy
npx sls deploy
Reference the path to the layer contents in your constructs:
constapp=newApp();conststack=newStack(app,'tesseract-lambda-ci');constal2Layer=newlambda.LayerVersion(stack,'al2-layer',{// reference the directory containing the ready-to-use layercode:Code.fromAsset(path.resolve(__dirname,'./ready-to-use/amazonlinux-2')),description:'AL2 Tesseract Layer',});newlambda.Function(stack,'python38',{// reference the source code to your functioncode:lambda.Code.fromAsset(path.resolve(__dirname,'lambda-handlers')),runtime:Runtime.PYTHON_3_8,// add tesseract layer to functionlayers:[al2Layer],memorySize:512,timeout:Duration.seconds(30),handler:'handler.main',});
You can build layer contents manually with theprovidedDockerfile
s.
Build layer using your preferredDockerfile
:
## builddocker build -t tesseract-lambda-layer -f [Dockerfile.al1|Dockerfile.al2].## run containerexport CONTAINER=$(docker run -d tesseract-lambda-layer false)## copy tesseract files from container to local folder layerdocker cp$CONTAINER:/opt/build-dist layer## remove Docker containerdocker rm$CONTAINERunset CONTAINER
Dockerfile | Base-Image | compatible Runtimes |
---|---|---|
Dockerfile.al1 (:warning: deprecated) | Amazon Linux 1 | Python 2.7/3.6/3.7, Ruby 2.5, Java 8 (OpenJDK), Go 1.x, .NET Core 2.1 |
Dockerfile.al2 | Amazon Linux 2 | Python 3.8, Ruby 2.7, Java 8/11 (Coretto), .NET Core 3.1 |
Per default the build generates thetesseract 4.1.3 (amazonlinux-1) or5.2.0 (amazonlinux-2) OCR libraries with thefast german, english and osd (orientation and script detection)data files included.
The build process can be modified using different build time arguments (defined asARG
inDockerfile.al[1|2]
), using the--build-arg
option ofdocker build
.
Build-Argument | description | available versions |
---|---|---|
TESSERACT_VERSION | the tesseract OCR engine | https://github.com/tesseract-ocr/tesseract/releases |
LEPTONICA_VERSION | fundamental image processing and analysis library | https://github.com/danbloomberg/leptonica/releases |
OCR_LANG | Language to install (in addition toeng andosd ) | https://github.com/tesseract-ocr/tessdata (<lang>.traineddata ) |
TESSERACT_DATA_SUFFIX | Trained LSTM models for tesseract. Can be empty (default),_best (best inference) and_fast (fast inference). | https://github.com/tesseract-ocr/tessdata,https://github.com/tesseract-ocr/tessdata_best,https://github.com/tesseract-ocr/tessdata_fast |
TESSERACT_DATA_VERSION | Version of the trained LSTM models for tesseract. (currently - in July 2022 - only4.1.0 is available) | https://github.com/tesseract-ocr/tessdata/releases/tag/4.1.0 |
Example of custom build
## Build a Dockerimage based on Amazon Linux 2, with French language supportdocker build --build-arg OCR_LANG=fra -t tesseract-lambda-layer-french -f Dockerfile.al2.## Build a Dockerimage based on Amazon Linux 2, with Tesseract 4.0.0 and french language supportdocker build --build-arg TESSERACT_VERSION=4.0.0 --build-arg OCR_LANG=fra -t tesseract-lambda-layer -f Dockerfile.al2.
The library files that are content of the layer are stripped, before deployment to make them more suitable for the lambda environment. SeeDockerfile
s:
RUN ... \ find ${DIST}/lib -name'*.so*' | xargs strip -s
The stripping can cause issues, when the build runtime and the lambda runtime are different (e.g. if building on Amazon Linux 1 and running on Amazon Linux 2).
You can build the layer directly and get the artifacts (like inready-to-use). This is done using AWS CDK with thebundling
option.
Refer tocontinous-integration and thecorresponding Github Workflow for an example.
The layer contents get deployed to/opt
, when used by a function. Seehere for details.Seeready-to-use for layer contents for Amazon Linux 1 and Amazon Linux 2 (TODO).
Usecloud9 IDE with AMI linux to deployexample. Or alternately follow instructions for getting correct binaries for lambda usingEC2. AWS lambda uses AMI linux distro which needs correct python binaries. This step is not needed for deploying layer function. Layer function and example function are separately deployed.
You might run into an issue like this:
/var/task/PIL/_imaging.cpython-36m-x86_64-linux-gnu.so: ELF load command address/offset not properly alignedUnable to import module 'handler': cannot import name '_imaging'
The root cause is a faulty stripping of libraries usingstrip
here.
Quickfix
You can just disable stripping (comment out the line in the
Dockerfile
) and the libraries (*.so
) won't be stripped. This also means the library files will be larger and your artifact might exceed lambda limits.
A lenghtier fix
AWS Lambda Runtimes work on top of Amazon Linux. Depending on the Runtime AWS Lambda uses Amazon Linux Version 1 or Version 2 under the hood.For example the Python 3.8 Runtime uses Amazon Linux 2, whereas Python <= 3.7 uses version 1.
The current Dockerfile runs on top of Amazon Linux Version 1. So artifacts for runtimes running version 2 will throw the above error.You can try and use a base Dockerimage for Amazon Linux 2 in these cases:
FROM: lambci/lambda-base-2:build...
or, as @secretshardul suggested
simple solution: Use AWS cloud9 to deploy example folder. Layer can be deployed from anywhere.complex solution: Deploy EC2 instance with AMI linux and get correct binaries.
- @secretshardul
- @TheLucasMoore for providing a Dockerfile that builds working binaries for Python 3.8 / Amazon Linux 2
About
A layer for AWS Lambda containing the tesseract C libraries and tesseract executable.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors7
Uh oh!
There was an error while loading.Please reload this page.