Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 489 Commits
.github		.github
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
jbang-catalog.json		jbang-catalog.json
pom.xml		pom.xml

Repository files navigation

tabula-java

tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powersTabula (repo). You can usetabula-java as a command-line tool to programmatically extract tables from PDFs.

Download

Download a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from ourreleases page.

Commandline Usage Examples

tabula-java provides a command line application:

$ java -jar target/tabula-1.0.5-jar-with-dependencies.jar --helpusage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-f <FORMAT>]       [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s       <PASSWORD>] [-t] [-u] [-v]Tabula helps you extract tables from PDFs -a,--area <AREA>           -a/--area = Portion of the page to analyze.                            Example: --area 269.875,12.75,790.5,561.                            Accepts top,left,bottom,right i.e. y1,x1,y2,x2                            where all values are in points relative to the                            top left corner. If all values are between                            0-100 (inclusive) and preceded by '%', input                            will be taken as % of actual height or width                            of the page. Example: --area %0,0,100,50. To                            specify multiple areas, -a option should be                            repeated. Default is entire page -b,--batch <DIRECTORY>     Convert all .pdfs in the provided directory. -c,--columns <COLUMNS>     X coordinates of column boundaries. Example                            --columns 10.1,20.2,30.3. If all values are                            between 0-100 (inclusive) and preceded by '%',                            input will be taken as % of actual width of                            the page. Example: --columns %25,50,80.6 -f,--format <FORMAT>       Output format: (CSV,TSV,JSON). Default: CSV -g,--guess                 Guess the portion of the page to analyze per                            page. -h,--help                  Print this help text. -i,--silent                Suppress all stderr output. -l,--lattice               Force PDF to be extracted using lattice-mode                            extraction (if there are ruling lines                            separating each cell, as in a PDF of an Excel                            spreadsheet) -n,--no-spreadsheet        [Deprecated in favor of -t/--stream] Force PDF                            not to be extracted using spreadsheet-style                            extraction (if there are no ruling lines                            separating each cell) -o,--outfile <OUTFILE>     Write output to <file> instead of STDOUT.                            Default: - -p,--pages <PAGES>         Comma separated list of ranges, or all.                            Examples: --pages 1-3,5-7, --pages 3 or                            --pages all. Default is --pages 1 -r,--spreadsheet           [Deprecated in favor of -l/--lattice] Force                            PDF to be extracted using spreadsheet-style                            extraction (if there are ruling lines                            separating each cell, as in a PDF of an Excel                            spreadsheet) -s,--password <PASSWORD>   Password to decrypt document. Default is empty -t,--stream                Force PDF to be extracted using stream-mode                            extraction (if there are no ruling lines                            separating each cell) -u,--use-line-returns      Use embedded line returns in cells. (Only in                            spreadsheet mode.) -v,--version               Print version and exit.

It also includes a debugging tool, runjava -cp ./target/tabula-1.0.5-jar-with-dependencies.jar technology.tabula.debug.Debug -h for the available options.

You can also integratetabula-java with any JVM language. For Java examples, see thetests folder.

JVM start-up time is a lot of the cost of thetabula command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:

the -b option, which allows you to convert all pdfs in a given directory
thedrip utility
theRuby,Python,R, andNode.js bindings
writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
waiting for us to implement an API/server-style system (it's on theroadmap)

API Usage Examples

A simple Java code example which extracts all rows and cells from all tables of all pages of a PDF document:

InputStreamin =this.getClass().getResourceAsStream("my.pdf");try (PDDocumentdocument =PDDocument.load(in)) {SpreadsheetExtractionAlgorithmsea =newSpreadsheetExtractionAlgorithm();PageIteratorpi =newObjectExtractor(document).extract();while (pi.hasNext()) {// iterate over the pages of the documentPagepage =pi.next();List<Table>table =sea.extract(page);// iterate over the tables of the pagefor(Tabletables:table) {List<List<RectangularTextContainer>>rows =tables.getRows();// iterate over the rows of the tablefor (List<RectangularTextContainer>cells :rows) {// print all column-cells of the row plus linefeedfor (RectangularTextContainercontent :cells) {// Note: Cell.getText() uses \r to concat text chunksStringtext =content.getText().replace("\r"," ");System.out.print(text +"|");                }System.out.println();            }        }    }}

For more detail information check the Javadoc.The Javadoc API documentation can be generated (see also 'Building from Source' section) via

mvn javadoc:javadoc

which generates the HTML files to directorytarget/site/apidocs/

Building from Source

Clone this repo and run:

mvn clean compile assembly:single

Contributing

Interested in helping out? We'd love to have your help!

You can help by:

Reporting a bug.
Adding or editing documentation.
Contributing code via a Pull Request.
Spreading the word abouttabula-java to people who might be able to benefit from using it.

Backers

You can also support our continued work ontabula-java with a one-time or monthly donationon OpenCollective. Organizations who usetabula-java can alsosponsor the project for acknowledgement onour official site and this README.

Special thanks to the following users and organizations for generously supporting Tabula with donations and grants: