- Notifications
You must be signed in to change notification settings - Fork1
Convert images to PDF, extract PDF text and execute commands depending on the content
License
mantoni/pdfmatch.js
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Convert images to PDFs withTesseract, extract PDF text withpdftotext and execute commands depending on the content. The purpose is toscan paperwork, generate a PDF with text overlay and apply a set of rules torename and move the PDF.
Thepdfmatch
command is used like this:
pdfmatch [options] source.{pdf,jpeg,tif,...} [target.pdf] Options: --config Use the given config file --delete Remove source file if match was found and command executed --debug Don't create the PDF or execute commands, but print the text -l Use the given language(s), overrides the configured "lang"
If no config file is specified,pdfmatch
will look for a file namedpdfmatch.json
in the current directory.
If the source file is a PDF, the text is extracted withpdftotext
and theconfigured rules are applied.
If the source file is not a PDF, it is expected to be an image and is convertedto a PDF with searchable text usingtesseract
. If no target file is given,the base name of the image is used for the PDF. In a second step, the text isextracted withpdftotext
and the configured rules are applied.
The configuration file can specify the language(s) to use withtesseract
anda set of rules to apply. After the first match, the associated command isexecuted and processing is stopped. If no match was found theno-match
command is executed.
Here is an example:
{"rules": [{"matches": [{"invoiceDate":"Invoive Date: ${DATE}" }, {"invoiceDate":"Ausstellungsdatum: ${DATE}" }],"command":"mv ${file} ${invoiceDate.format('YYYY-MM-DD')}\\ invoice.pdf" }],"no-match":"mv ${file} ${now.format('YYYY-MM-DD_HHmmss')}.pdf"}
The configuration properties are:
lang
: The language(s) to pass to Tesseractrules
: An array of rules to run, where each rule is an object with theseproperties:match
: A single match object or an array of match objects, passed totext-matchcommand
: The command to execute, after substituting any JavaScriptexpressions
no-match
: A default command to execute if no matching rule was found
Thecommand
can contain variables in the form${...}
where...
is aJavaScript expression with access to the matched properties. After successfulsubstitution, the command is written to the console and executed usingchild_process.execSync(command)
.
These special properties can be accessed in commands:
file
: The PDF filenow
: The current date as amoment object
Installing Tesseract with brew:
$ brew install tesseract --with-all-languages
Installingpdftotext
(or download fromhttp://www.foolabs.com/xpdf/download.html):
$ brew install Caskroom/cask/pdftotext
Installing this tool:
$ npm install pdfmatch -g
My working setup is a~/Documents/Scans
folder containing only mypdfmatch.json
configuration. The commands in the rules move the matched filesone level up:
{"lang":"deu+eng","rules": [{"match": {"company":"npm, Inc","invoiceDate":"${DATE}" },"command":"mv ${file} ../${invoiceDate.format('YYYY-MM-DD')}\\ npm.pdf" }],"no-match":"mv ${file} ../${now.format('YYYY-MM-DD_HHmmss')}.pdf"}
If you're following the above example setup, there is an AppleScript folderaction in./scripts
which allows you to save or drop files in a specialfolder and havepdfmatch
invoked automatically. Follow the instructions inthe header comments on how to use it.
This module exposes an API ifrequire
d as a node module:
processText(pdf_file, config, callback)
: Extract text from a PDF file andapplies rules from the given configuration (see above).processImage(image_file, pdf_file, config, callback)
: Converts an image toa PDF file and then callsprocessText
with the result.
MIT
About
Convert images to PDF, extract PDF text and execute commands depending on the content
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.