- Notifications
You must be signed in to change notification settings - Fork214
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
License
JonathanLink/PDFLayoutTextStripper
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from theApache PDFBox library).
Data extraction from a form in a PDF file
<dependency> <groupId>io.github.jonathanlink</groupId> <artifactId>PDFLayoutTextStripper</artifactId> <version>2.2.3</version></dependency>
- Installapache pdfbox manually (to get the v2.0.6 click here ) and its two dependenciescommons-logging.jar and fontbox
warning: only pdfbox versionsfrom version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java
cd PDFLayoutTextStripperjavac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.javajava -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test
The same as for Linux (see above) but replace : with ;
import java.io.File;import java.io.FileNotFoundException;import java.io.IOException;import org.apache.pdfbox.io.RandomAccessFile;import org.apache.pdfbox.pdfparser.PDFParser;import org.apache.pdfbox.pdmodel.PDDocument;import org.apache.pdfbox.text.PDFTextStripper;public class test {public static void main(String[] args) {String string = null; try { PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("./samples/bus.pdf"), "r")); pdfParser.parse(); PDDocument pdDocument = new PDDocument(pdfParser.getDocument()); PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper(); string = pdfTextStripper.getText(pdDocument); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }; System.out.println(string);}}
Thanks to
- Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3)
- Ho Ting Cheng for reporting an issue (v2.1)
- James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)
About
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).