- Notifications
You must be signed in to change notification settings - Fork233
Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text
License
sajari/docconv
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text.
If you haven't setup Go before, you first need toinstall Go.
To fetch and build the code:
$go install code.sajari.com/docconv/v2/docd@latest
Seego help install
for details on the installation location of the installeddocd
executable. Make sure that the full path to the executable is in yourPATH
environment variable.
- tidy
- wv
- popplerutils
- unrtf
- https://github.com/JalfResi/justext
$sudo apt-get install poppler-utils wv unrtf tidy$go get github.com/JalfResi/justext
$brew install poppler-qt5 wv unrtf tidy-html5$go get github.com/JalfResi/justext
To add image support to thedocconv
library you first need toinstall and build gosseract.
Now you can add-tags ocr
to anygo
command when building/fetching/testingdocconv
to include support for processing images:
$go get -tags ocr code.sajari.com/docconv/v2/...
This may complain on macOS, which you can fix by installingtesseract via brew:
$brew install tesseract
Thedocd
tool runs as either:
a service on port 8888 (by default)
Documents can be sent as a multipart POST request and the plain text (body) and meta information are then returned as a JSON object.
a service exposed from within a Docker container
This also runs as a service, but from within a Docker container.Official images are published athttps://hub.docker.com/r/sajari/docd.
Optionally you can build it yourself:
$cd docd$docker build -t docd.
via the command line.
Documents can be sent as an argument, e.g.
$docd -input document.pdf
addr
- the bind address for the HTTP server, default is ":8888"readability-length-low
- sets the readability length low if the ?readability=1 parameter is setreadability-length-high
- sets the readability length high if the ?readability=1 parameter is setreadability-stopwords-low
- sets the readability stopwords low if the ?readability=1 parameter is setreadability-stopwords-high
- sets the readability stopwords high if the ?readability=1 parameter is setreadability-max-link-density
- sets the readability max link density if the ?readability=1 parameter is setreadability-max-heading-distance
- sets the readability max heading distance if the ?readability=1 parameter is setreadability-use-classes
- comma separated list of readability classes to use if the ?readability=1 parameter is set
$# This runs on port 8000$docd -addr :8000
Some basic code is shown below, but normally you would accept the file by HTTP or open it from the file system.
This should be enough to get you started though.
Note: this assumes you have thedependencies installed.
package mainimport ("fmt""code.sajari.com/docconv/v2")funcmain() {res,err:=docconv.ConvertPath("your-file.pdf")iferr!=nil {// TODO: handle}fmt.Println(res)}
package mainimport ("fmt""code.sajari.com/docconv/v2/client")funcmain() {// Create a new client, using the default endpoint (localhost:8888)c:=client.New()res,err:=client.ConvertPath(c,"your-file.pdf")iferr!=nil {// TODO: handle}fmt.Println(res)}
Alternatively, via acurl
:
$curl -s -F input=@your-file.pdf http://localhost:8888/convert
About
Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text