- Notifications
You must be signed in to change notification settings - Fork214
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
License
JonathanLink/PDFLayoutTextStripper
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from theApache PDFBox library).
Data extraction from a form in a PDF file
<dependency> <groupId>io.github.jonathanlink</groupId> <artifactId>PDFLayoutTextStripper</artifactId> <version>2.2.3</version></dependency>
- Installapache pdfbox manually (to get the v2.0.6 click here ) and its two dependenciescommons-logging.jar and fontbox
warning: only pdfbox versionsfrom version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java
cd PDFLayoutTextStripperjavac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.javajava -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test
The same as for Linux (see above) but replace : with ;
import java.io.File;import java.io.FileNotFoundException;import java.io.IOException;import org.apache.pdfbox.io.RandomAccessFile;import org.apache.pdfbox.pdfparser.PDFParser;import org.apache.pdfbox.pdmodel.PDDocument;import org.apache.pdfbox.text.PDFTextStripper;public class test {public static void main(String[] args) {String string = null; try { PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r")); pdfParser.parse(); PDDocument pdDocument = new PDDocument(pdfParser.getDocument()); PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper(); string = pdfTextStripper.getText(pdDocument); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }; System.out.println(string);}}
Thanks to
- Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3)
- Ho Ting Cheng for reporting an issue (v2.1)
- James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)
About
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.