- Notifications
You must be signed in to change notification settings - Fork13
Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV
License
rostrovsky/pdf-table
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
PDF-table is Java utility library that can be used for parsing tabular data in PDF documents.
Core processing of PDF documents is performed with utilization ofApache PDFBox andOpenCV.
pdf-table requires compiledOpenCV 3.4.2 to work properly:
Download OpenCV v3.4.2 fromhttps://github.com/opencv/opencv/releases/tag/3.4.2
Unpack it and add to your system PATH:
Windows:
<opencv dir>\build\java\x64
Linux:
TODO
<dependency> <groupId>com.github.rostrovsky</groupId> <artifactId>pdf-table</artifactId> <version>1.0.0</version></dependency>
When PDF document page is being parsed, following operations are performed:
Page is converted to grayscale image [OpenCV].
Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].
Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].
Contour mask is XORed with BIT image [OpenCV].
Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].
Final contours are drawn [OpenCV].
Bounding rectangles are detected from final contours [OpenCV].
PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].
Above algorithm is mostly derived fromhttp://stackoverflow.com/a/23106594.
For more information about parsed output, refer toOutput format
classSingleThreadParser {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();List<ParsedTablePage>parsed =reader.parsePdfTablePages(pdfDoc,1,pdfDoc.getNumberOfPages()); }}
classMultiThreadParser {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();// parse pages simultaneouslyExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<ParsedTablePage>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<ParsedTablePage>callable = () -> {ParsedTablePagepage =reader.parsePdfTablePage(pdfDoc,pageNum);returnpage; };futures.add(executor.submit(callable)); }// collect parsed pagesList<ParsedTablePage>unsortedParsedPages =newArrayList<>(pdfDoc.getNumberOfPages());try {for (Future<ParsedTablePage>f :futures) {ParsedTablePagepage =f.get();unsortedParsedPages.add(page.getPageNum() -1,page); } }catch (Exceptione) {thrownewRuntimeException(e); }// sort pages by pageNumList<ParsedTablePage>sortedParsedPages =unsortedParsedPages.stream() .sorted((p1,p2) ->Integer.compare(p1.getPageNum(),p2.getPageNum())).collect(Collectors.toList()); }}
PDF-Table provides methods for saving PDF pages as PNG images.
Rendering DPI can be modified inPdfTableSettings
(see:Parsing settings).
classSingleThreadPNGDump {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PathoutputPath =Paths.get("C:","some_directory");PdfTableReaderreader =newPdfTableReader();reader.savePdfPagesAsPNG(pdfDoc,1,pdfDoc.getNumberOfPages(),outputPath); }}
classMultiThreadPNGDump {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PathoutputPath =Paths.get("C:","some_directory");PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();ExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<Boolean>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<Boolean>callable = () -> {reader.savePdfPageAsPNG(pdfDoc,pageNum,outputPath);returntrue; };futures.add(executor.submit(callable)); }try {for (Future<Boolean>f :futures) {f.get(); } }catch (Exceptione) {thrownewRuntimeException(e); } }}
When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show pageat various stages of processing.
Using these images, user can adjustPdfTableSettings
accordingly to achieve desired results(see:Parsing settings).
classSingleThreadDebugImgsDump {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PathoutputPath =Paths.get("C:","some_directory");PdfTableReaderreader =newPdfTableReader();reader.savePdfTablePagesDebugImages(pdfDoc,1,pdfDoc.getNumberOfPages(),outputPath); }}
classMultiThreadDebugImgsDump {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PathoutputPath =Paths.get("C:","some_directory");PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();ExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<Boolean>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<Boolean>callable = () -> {reader.savePdfTablePagesDebugImage(pdfDoc,pageNum,outputPath);returntrue; };futures.add(executor.submit(callable)); }try {for (Future<Boolean>f :futures) {f.get(); } }catch (Exceptione) {thrownewRuntimeException(e); } }}
PDF rendering and OpenCV filtering settings are stored inPdfTableSettings
object.
Custom settings instance can be passed toPdfTableReader
constructor when non-default values are needed:
(...)// build settings objectPdfTableSettingssettings =PdfTableSettings.getBuilder() .setCannyFiltering(true) .setCannyApertureSize(5) .setCannyThreshold1(40) .setCannyThreshold2(190.5) .setPdfRenderingDpi(160) .build();// pass settings to readerPdfTableReaderreader =newPdfTableReader(settings);
Each parsed PDF page is being returned asParsedTablePage
object:
(...)PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();// first page in document has index == 1, not 0 !ParsedTablePagefirstPage =reader.parsePdfTablePage(pdfDoc,1);// getting page numberassertfirstPage.getPageNum() ==1;// rows and cells are zero-indexed just like elements of the List// getting first rowParsedTablePage.ParsedTableRowfirstRow =firstPage.getRow(0);// getting third cell in second rowStringthirdCellContent =firstPage.getRow(1).getCell(2);// cell content usually contain <CR><LF> characters,// so it is recommended to trim them before processingdoublethirdCellNumericValue =Double.valueOf(thirdCellContent.trim());
About
Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV