Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Convert images to PDF, extract PDF text and execute commands depending on the content

License

NotificationsYou must be signed in to change notification settings

mantoni/pdfmatch.js

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Convert images to PDFs withTesseract, extract PDF text withpdftotext and execute commands depending on the content. The purpose is toscan paperwork, generate a PDF with text overlay and apply a set of rules torename and move the PDF.

Usage

Thepdfmatch command is used like this:

pdfmatch [options] source.{pdf,jpeg,tif,...} [target.pdf]  Options:    --config  Use the given config file    --delete  Remove source file if match was found and command executed     --debug  Don't create the PDF or execute commands, but print the text          -l  Use the given language(s), overrides the configured "lang"

If no config file is specified,pdfmatch will look for a file namedpdfmatch.json in the current directory.

If the source file is a PDF, the text is extracted withpdftotext and theconfigured rules are applied.

If the source file is not a PDF, it is expected to be an image and is convertedto a PDF with searchable text usingtesseract. If no target file is given,the base name of the image is used for the PDF. In a second step, the text isextracted withpdftotext and the configured rules are applied.

The configuration file can specify the language(s) to use withtesseract anda set of rules to apply. After the first match, the associated command isexecuted and processing is stopped. If no match was found theno-matchcommand is executed.

Here is an example:

{"rules": [{"matches": [{"invoiceDate":"Invoive Date: ${DATE}"    }, {"invoiceDate":"Ausstellungsdatum: ${DATE}"    }],"command":"mv ${file} ${invoiceDate.format('YYYY-MM-DD')}\\ invoice.pdf"  }],"no-match":"mv ${file} ${now.format('YYYY-MM-DD_HHmmss')}.pdf"}

The configuration properties are:

  • lang: The language(s) to pass to Tesseract
  • rules: An array of rules to run, where each rule is an object with theseproperties:
    • match: A single match object or an array of match objects, passed totext-match
    • command: The command to execute, after substituting any JavaScriptexpressions
  • no-match: A default command to execute if no matching rule was found

Thecommand can contain variables in the form${...} where... is aJavaScript expression with access to the matched properties. After successfulsubstitution, the command is written to the console and executed usingchild_process.execSync(command).

These special properties can be accessed in commands:

  • file: The PDF file
  • now: The current date as amoment object

Install

Installing Tesseract with brew:

$ brew install tesseract --with-all-languages

Installingpdftotext (or download fromhttp://www.foolabs.com/xpdf/download.html):

$ brew install Caskroom/cask/pdftotext

Installing this tool:

$ npm install pdfmatch -g

Example setup

My working setup is a~/Documents/Scans folder containing only mypdfmatch.json configuration. The commands in the rules move the matched filesone level up:

{"lang":"deu+eng","rules": [{"match": {"company":"npm, Inc","invoiceDate":"${DATE}"    },"command":"mv ${file} ../${invoiceDate.format('YYYY-MM-DD')}\\ npm.pdf"  }],"no-match":"mv ${file} ../${now.format('YYYY-MM-DD_HHmmss')}.pdf"}

AppleScript folder action

If you're following the above example setup, there is an AppleScript folderaction in./scripts which allows you to save or drop files in a specialfolder and havepdfmatch invoked automatically. Follow the instructions inthe header comments on how to use it.

API

This module exposes an API ifrequired as a node module:

  • processText(pdf_file, config, callback): Extract text from a PDF file andapplies rules from the given configuration (see above).
  • processImage(image_file, pdf_file, config, callback): Converts an image toa PDF file and then callsprocessText with the result.

License

MIT

About

Convert images to PDF, extract PDF text and execute commands depending on the content

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp