Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork680
Tabula is a tool for liberating data tables trapped inside PDF files
License
tabulapdf/tabula
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Istabula an active project?
Tabula is, and always has been, a volunteer-run project. We've occasionally had funding for specific features, but it's never been a commercial undertaking. At the moment, none of the original authors have the time to actively work on the project. The end-user application, hosted on this repo, is unlikely to see updates from us in the near future.tabula-java sees updates and occasional bug-fix releases from time to time.
--
Repo Note: Themaster branch is anin development version of Tabula. This may be substantially different from the latestreleases of Tabula.
Tabula helps you liberate data tables trapped inside PDF files.
- Download from the official site
- Read more about Tabula on OpenNews Source
- Interested in using Tabula on the command-line? Check outtabula-java, a Java library and command-line interface for Tabula. (This is the extraction library that powers Tabula.)
© 2012-2020 Manuel Aristarán. Available under MIT License. SeeAUTHORS.md andLICENSE.md.
- Why Tabula?
- Using Tabula
- Known issues
- Incorporating Tabula into your ownproject
- Running Tabula from source(for developers)
- Contributing
If you’ve ever tried to do anything with data provided to you in PDFs, youknow how painful this is — you can’t easily copy-and-paste rows of data outof PDF files. Tabula allows you to extract that data in CSV format, througha simple web interface.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
Security Concerns?: Tabula is designed with security in mind. Your PDF and the extracted datanever touch the net -- when you use Tabula on your local machine, as long as your browser's URL bar says "localhost" or "127.0.0.1", all processing takes place on your local machine. Other than to retrieve a few badges and other static assets, there are two calls that are made from your browser to external machines; one fetches the list of latest Tabula versions from GitHub to alert you if Tabula has been updated, the other makes a call to a stats counter that helps us determine how often various versions of Tabula are being used. If this is a problem, the version check can be disabled by adding-Dtabula.disable_version_check=1 to the command line at startup, and the stats counter call can be disabled by adding-Dtabula.disable_notifications=1. Please note: If you are providing Tabula as a service using a reverse SSL proxy, usersmay notice a security warning due to our stats counter endpoint being hosted at a non-secure URL, so you may wish to disable the notifications in this scenario.
First, make sure you have a recent copy of Java installed. You candownload Java here. Tabula requiresa Java Runtime Environment compatible with Java 7 (i.e. Java 7, 8 or higher).If you have a problem, checkKnown Issues first, thenreport an issue.
Download
tabula-win.zipfromthe download site. Unzip the whole thingand open thetabula.exefile inside. A browser should automatically opentohttp://127.0.0.1:8080/ . If not, open your web browser of choice andvisit that link.To close Tabula, just go back to the console window and press "Control-C"(as if to copy).
Download
tabula-mac.zipfromthe download site. Unzip and openthe Tabula app inside. A browser should automatically opentohttp://127.0.0.1:8080/ . If not, open your web browser of choice andvisit that link.To close Tabula, find the Tabula icon in your dock, right-click (orcontrol-click) on it, and press "Quit".
Note: If you’re running Mac OS X 10.8 or later, you might get an error like "Tabula is damaged and can't be opened." We're working on fixing this, but clickhere for a workaround.
Tabula is packaged as a snap package. If you have snap on your system, you can install it with
sudo snap install tabula
Download
tabula-jar.zipfromthe download site and unzip itto the directory of your choice. Open a terminal window, andcdto insidethetabuladirectory you just unzipped. Then run:java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -jar tabula.jarThen manually navigate your browser tohttp://127.0.0.1:8080/ (New inTabula 1.1. To go back to the old behavior that automatically launchesyour web browser, use the
-Dtabula.openBrowser=trueoption.Tabula binds to port 8080 by default. You can change it with the
warbler.portoption; for example, to use port 9999:java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -Dwarbler.port=9999 -jar tabula.jarDocker Compose quick start usingAmazon Correttto image
Make a new directory e.g.
tabulapdfand enter it.mkdir -p /opt/docker/tabulapdfcd /opt/docker/tabulapdfDownload tabula-jar package - for example version 1.2.1
wget https://github.com/tabulapdf/tabula/releases/download/v1.2.1/tabula-jar-1.2.1.zipverify checksum (compare output with the release page)
sha256sum tabula-jar-1.2.1.zipand unzip it.
unzip tabula-jar-1.2.1.zipPlace or create a
docker-compose.ymlfile, adjust accordingly### tabulapdf docker-compose.yml example ###services:tabulapdf: image: amazoncorretto:17 container_name: tabulapdf-app command: > java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -Dwarbler.port=8080 -Dtabula.openBrowser=false -jar /app/tabula.jar volumes: - ./tabula:/app ports: - "8080:8080"Run the app with
docker compose up -dThe app will be exposed on port 8080 and can be easily paired with a reverse proxy e.g. traefik
If the program fails to run, double-check that you haveJava installedand then try again.
There are some bugs that we're aware of that we haven't managed to fix yet. If there's not a solution here or you need more help, please go ahead andreport an issue.
Legacy Java Environment (SE 6) Is Required: (Mac):The Mac operating system recently changed how it packages the Java Runtime Environment. If you get this error, download Tabula's"large experimental" package. This package includes its own Java Runtime Environment and should work without this issue.
"Tabula is damaged and can't be opened" (Mac):If you’re running Mac OS X 10.8 or later, GateKeeper may prevent you from openingthe Tabula app. Pleasesee this GateKeeper page for more information.
- Right-click on Tabula.app and select Open from the context menu.
- The system will tell you that the application is "from an unidentified developer" and ask you whether you want to open it. Click Open to allow the application to run. The system remembers this choice and won't prompt you again.
(If you continue to have issues, double-check theOS X GateKeeper documentation for more information.)
org.jruby.exceptions.RaiseException: (Encoding::CompatibilityError) incompatible character encodings: (Windows):Your Windows computer expects a type of encoding other than Unicode or Windows's English encoding. You can fix this by entering a few simple commands in the Command Prompt. (The commands won't affect anything besides Tabula.)
- Open a Command Prompt
- type
cdand then the path to the directory that containstabula.exe, e.g.cd C:\Users\Username\Downloads - Change that terminal's codepage to Unicode by typing:
chcp 65001 - Run Tabula by typing
tabula.exe
A browser tab opens, but something other than Tabula loads there. Or Tabula doesn't start.It's possible another program is using port 8080, which Tabula binds to by default. You can try closing the other program, or change the port Tabula uses by running Tabula from the terminal with the
warbler.portproperty:java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -Dwarbler.port=9999 -jar tabula.jar
Tabula is open-source, so we'd love for you to incorporate pieces of Tabula into your own projects. The "guts" of Tabula -- that is, the logic and heuristics that reconstruct tables from PDFs -- is contained in thetabula-java repo. There's a JAR file that you can easily incorporate into JVM languages like Java, Scala or Clojure and it includes a command-line tool for you to automate your extraction tasks. Visit that repo for more information on how to usetabula-java on the CLI and on how Tabula exportstabula-java scripts.
Tabula has bindings for JRuby and R. If you end up writing bindings for another language, let us know and we'll add a link here.
- tabulizer providesR bindings for tabula-java and is community-supported by@leeper.
- tabula-js providesNode.js bindings for tabula-java; it is community-supported by@ezodude.
- tabula-py providesPython bindings for tabula-java; it is community-supported by@chezou.
- tabula-extractorDEPRECATED - Provides JRuby bindings for tabula-java
Download JRuby. You can install it from its website, or using tools like
rvmorrbenv. Note that as of Tabula 1.1.0 (7875582becb2799b65586d5680782cafd399bb33), Tabula uses the JRuby 9000 series (i.e. JRuby 9.1.5.0).Download Tabula and install the Ruby dependencies. (Note: if using
rvmorrbenv, ensure that JRuby is being used.git clone git://github.com/tabulapdf/tabula.gitcd tabulagem install bundler -v 1.17.3bundle installjruby -S jbundle install
Then, start the development server:
jruby -G -r jbundler -S rackup(If you get encoding errors, set theJAVA_OPTS environment variable to-Dfile.encoding=utf-8)
The site instance should now be viewable athttp://127.0.0.1:9292/ .
You can a couple some options when executing the server in this manner:
TABULA_DATA_DIR="/tmp/tabula" \TABULA_DEBUG=1 \jruby -G -r jbundler -S rackupTABULA_DATA_DIRcontrols where uploaded data for Tabula is stored. By default,data is stored in the OS-dependent application data directory for the currentuser. (similar to:C:\Users\foo\AppData\Roaming\Tabulaon Windows,~/Library/Application Support/Tabulaon Mac,~/.tabulaon Linux/UNIX)TABULA_DEBUGprints out extra status data when PDF files are being processed.(falseby default.)
Alternatively, running the server as a JAR file
Testing in this manner will be closer to testing the "packaged application"version of the app.
jruby -G -S rake warjava -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -jar build/tabula.jarIf you intend to develop against an unreleased version oftabula-java, you need to install its JAR to your local Maven repository. From the directory that containstabula-java source:
mvn install:install-file -Dfile=target/tabula-<version>-SNAPSHOT.jar -DgroupId=technology.tabula -DartifactId=tabula -Dversion=<version>-SNAPSHOT -Dpackaging=jar -DpomFile=pom.xmlThen, adjust theJarfile accordingly.
After performing the above steps ("Running Tabula from source"), you can compileTabula into a standalone application:
Mac OS X
If you wish to share Tabula with other machines, you will need a codesigning certificate.Our distribution of Tabula uses a self-signed certificate, as noted above. Seethis section of build.xml for details. If you will only be running Tabulaon the machine you are building it on, you may remove this entire block (lines 44-53).
To compile the app:
WEBSERVER_VERSION=9.4.31.v20200723 MAVEN_REPO=https://repo1.maven.org/maven2 rake macosxThis will result in a portable "tabula_mac.zip" archive (inside thebuild directory)for Mac OS X users.
Note that the Mac version bundles Java with the Tabula app.This results in a 98MB zip file, versus the 30MB zip file for other platforms,but allows users to run Tabula without having to worry aboutJava versionincompatibilities.
Windows
You can build .exe files for the Windows target on any platform.
Download a3.1.X (beta) copy of Launch4J.
Unzip it into the Tabula repo so that "launch4j" (with subdirectories "bin", etc.)is in the repository root.
(If you're building on a 64bit Linux, you may need to install 32bit libs like, in Ubuntusudo apt-get install lib32z1 lib32ncurses5)
Then:
WEBSERVER_VERSION=9.4.31.v20200723 MAVEN_REPO=https://repo1.maven.org/maven2 rake windowsThis will result in a portable "tabula_win.zip" archive (inside thebuild directory)for Mac OS X users.
If you have issues, you can try building manually. (These commands are forOS X/Linux and may need to be adjusted for Windows users.)
# (from the root directory of the repo)WEBSERVER_VERSION=9.4.31.v20200723 MAVEN_REPO=https://repo1.maven.org/maven2 rake warcd launch4jant -f ../build.xml windowsA "tabula.exe" file will be generated in "build/windows". To run, the exe fileneeds "tabula.jar" (contained in "build") in the same directory. You can create a.zip archive by doing:
# (from the root directory of the repo)cd build/windowsmkdir tabulacp tabula.exe ./tabula/cp ../tabula.jar ./tabula/zip -r9 tabula_win.zip tabularm -fr tabulaInterested in helping out? We'd love to have your help!
You can help by:
- Reporting a bug.
- Adding or editing documentation.
- Contributing code via a Pull Request from ideas or bugs listed in theEnhancements section of the issues.see
CONTRIBUTING.md - Spreading the word about Tabula to people who might be able to benefit from using it.
You can also support our continued work on Tabula with a one-time or monthly donationon OpenCollective. Organizations who use Tabula can alsosponsor the project for acknowledgement onour official site and this README.
Tabula is made possible in part throughthe generosity of our users and through grants from theKnight Foundation and theShuttleworth Foundation. Special thanks to all the users and organizations that support Tabula!
More acknowledgments can be found inAUTHORS.md.
About
Tabula is a tool for liberating data tables trapped inside PDF files
Topics
Resources
License
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.

