- Notifications
You must be signed in to change notification settings - Fork444
Extract tables from PDF files
License
tabulapdf/tabula-java
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powersTabula (repo). You can usetabula-java as a command-line tool to programmatically extract tables from PDFs.
© 2014-2020 Manuel Aristarán. Available under MIT License. SeeLICENSE.
Download a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from ourreleases page.
tabula-java provides a command line application:
$ java -jar target/tabula-1.0.5-jar-with-dependencies.jar --helpusage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-f <FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s <PASSWORD>] [-t] [-u] [-v]Tabula helps you extract tables from PDFs -a,--area <AREA> -a/--area = Portion of the page to analyze. Example: --area 269.875,12.75,790.5,561. Accepts top,left,bottom,right i.e. y1,x1,y2,x2 where all values are in points relative to the top left corner. If all values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Example: --area %0,0,100,50. To specify multiple areas, -a option should be repeated. Default is entire page -b,--batch <DIRECTORY> Convert all .pdfs in the provided directory. -c,--columns <COLUMNS> X coordinates of column boundaries. Example --columns 10.1,20.2,30.3. If all values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual width of the page. Example: --columns %25,50,80.6 -f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV -g,--guess Guess the portion of the page to analyze per page. -h,--help Print this help text. -i,--silent Suppress all stderr output. -l,--lattice Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) -n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF not to be extracted using spreadsheet-style extraction (if there are no ruling lines separating each cell) -o,--outfile <OUTFILE> Write output to <file> instead of STDOUT. Default: - -p,--pages <PAGES> Comma separated list of ranges, or all. Examples: --pages 1-3,5-7, --pages 3 or --pages all. Default is --pages 1 -r,--spreadsheet [Deprecated in favor of -l/--lattice] Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) -s,--password <PASSWORD> Password to decrypt document. Default is empty -t,--stream Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell) -u,--use-line-returns Use embedded line returns in cells. (Only in spreadsheet mode.) -v,--version Print version and exit.It also includes a debugging tool, runjava -cp ./target/tabula-1.0.5-jar-with-dependencies.jar technology.tabula.debug.Debug -h for the available options.
You can also integratetabula-java with any JVM language. For Java examples, see thetests folder.
JVM start-up time is a lot of the cost of thetabula command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:
- the -b option, which allows you to convert all pdfs in a given directory
- thedrip utility
- theRuby,Python,R, andNode.js bindings
- writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
- waiting for us to implement an API/server-style system (it's on theroadmap)
A simple Java code example which extracts all rows and cells from all tables of all pages of a PDF document:
InputStreamin =this.getClass().getResourceAsStream("my.pdf");try (PDDocumentdocument =PDDocument.load(in)) {SpreadsheetExtractionAlgorithmsea =newSpreadsheetExtractionAlgorithm();PageIteratorpi =newObjectExtractor(document).extract();while (pi.hasNext()) {// iterate over the pages of the documentPagepage =pi.next();List<Table>table =sea.extract(page);// iterate over the tables of the pagefor(Tabletables:table) {List<List<RectangularTextContainer>>rows =tables.getRows();// iterate over the rows of the tablefor (List<RectangularTextContainer>cells :rows) {// print all column-cells of the row plus linefeedfor (RectangularTextContainercontent :cells) {// Note: Cell.getText() uses \r to concat text chunksStringtext =content.getText().replace("\r"," ");System.out.print(text +"|"); }System.out.println(); } } }}
For more detail information check the Javadoc.The Javadoc API documentation can be generated (see also 'Building from Source' section) via
mvn javadoc:javadocwhich generates the HTML files to directorytarget/site/apidocs/
Clone this repo and run:
mvn clean compile assembly:singleInterested in helping out? We'd love to have your help!
You can help by:
- Reporting a bug.
- Adding or editing documentation.
- Contributing code via a Pull Request.
- Spreading the word about
tabula-javato people who might be able to benefit from using it.
You can also support our continued work ontabula-java with a one-time or monthly donationon OpenCollective. Organizations who usetabula-java can alsosponsor the project for acknowledgement onour official site and this README.
Special thanks to the following users and organizations for generously supporting Tabula with donations and grants:
About
Extract tables from PDF files
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
