rostrovsky/pdf-tablePublic

NotificationsYou must be signed in to change notification settings
Fork13
Star72

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

License

MIT license

72 stars 13 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
build.gradle		build.gradle
pom.xml		pom.xml
settings.gradle		settings.gradle

Repository files navigation

PDF-table

Table of Contents

What is PDF-table?

PDF-table is Java utility library that can be used for parsing tabular data in PDF documents.
Core processing of PDF documents is performed with utilization ofApache PDFBox andOpenCV.

Prerequisites

JDK

JAVA 8 is required.

External dependencies

pdf-table requires compiledOpenCV 3.4.2 to work properly:

Download OpenCV v3.4.2 fromhttps://github.com/opencv/opencv/releases/tag/3.4.2
Unpack it and add to your system PATH:
- Windows:<opencv dir>\build\java\x64
- Linux:TODO

Installation

<dependency>  <groupId>com.github.rostrovsky</groupId>  <artifactId>pdf-table</artifactId>  <version>1.0.0</version></dependency>

Usage

Parsing PDFs

When PDF document page is being parsed, following operations are performed:

Page is converted to grayscale image [OpenCV].
Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].
Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].
Contour mask is XORed with BIT image [OpenCV].
Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].
Final contours are drawn [OpenCV].
Bounding rectangles are detected from final contours [OpenCV].
PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived fromhttp://stackoverflow.com/a/23106594.

For more information about parsed output, refer toOutput format

single-threaded example

classSingleThreadParser {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();List<ParsedTablePage>parsed =reader.parsePdfTablePages(pdfDoc,1,pdfDoc.getNumberOfPages());    }}

multi-threaded example

classMultiThreadParser {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();// parse pages simultaneouslyExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<ParsedTablePage>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<ParsedTablePage>callable = () -> {ParsedTablePagepage =reader.parsePdfTablePage(pdfDoc,pageNum);returnpage;            };futures.add(executor.submit(callable));        }// collect parsed pagesList<ParsedTablePage>unsortedParsedPages =newArrayList<>(pdfDoc.getNumberOfPages());try {for (Future<ParsedTablePage>f :futures) {ParsedTablePagepage =f.get();unsortedParsedPages.add(page.getPageNum() -1,page);            }        }catch (Exceptione) {thrownewRuntimeException(e);        }// sort pages by pageNumList<ParsedTablePage>sortedParsedPages =unsortedParsedPages.stream()                .sorted((p1,p2) ->Integer.compare(p1.getPageNum(),p2.getPageNum())).collect(Collectors.toList());    }}

Saving PDF pages as PNG images

PDF-Table provides methods for saving PDF pages as PNG images.
Rendering DPI can be modified inPdfTableSettings (see:Parsing settings).

single-threaded example

classSingleThreadPNGDump {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PathoutputPath =Paths.get("C:","some_directory");PdfTableReaderreader =newPdfTableReader();reader.savePdfPagesAsPNG(pdfDoc,1,pdfDoc.getNumberOfPages(),outputPath);    }}

multi-threaded example

classMultiThreadPNGDump {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PathoutputPath =Paths.get("C:","some_directory");PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();ExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<Boolean>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<Boolean>callable = () -> {reader.savePdfPageAsPNG(pdfDoc,pageNum,outputPath);returntrue;            };futures.add(executor.submit(callable));        }try {for (Future<Boolean>f :futures) {f.get();            }        }catch (Exceptione) {thrownewRuntimeException(e);        }    }}

Saving debug PNG images

When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show pageat various stages of processing.
Using these images, user can adjustPdfTableSettings accordingly to achieve desired results(see:Parsing settings).

single-threaded example

classSingleThreadDebugImgsDump {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PathoutputPath =Paths.get("C:","some_directory");PdfTableReaderreader =newPdfTableReader();reader.savePdfTablePagesDebugImages(pdfDoc,1,pdfDoc.getNumberOfPages(),outputPath);    }}

multi-threaded example

classMultiThreadDebugImgsDump {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PathoutputPath =Paths.get("C:","some_directory");PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();ExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<Boolean>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<Boolean>callable = () -> {reader.savePdfTablePagesDebugImage(pdfDoc,pageNum,outputPath);returntrue;            };futures.add(executor.submit(callable));        }try {for (Future<Boolean>f :futures) {f.get();            }        }catch (Exceptione) {thrownewRuntimeException(e);        }    }}

Parsing settings

PDF rendering and OpenCV filtering settings are stored inPdfTableSettings object.

Custom settings instance can be passed toPdfTableReader constructor when non-default values are needed:

(...)// build settings objectPdfTableSettingssettings =PdfTableSettings.getBuilder()                .setCannyFiltering(true)                .setCannyApertureSize(5)                .setCannyThreshold1(40)                .setCannyThreshold2(190.5)                .setPdfRenderingDpi(160)                .build();// pass settings to readerPdfTableReaderreader =newPdfTableReader(settings);

Output format

Each parsed PDF page is being returned asParsedTablePage object:

(...)PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();// first page in document has index == 1, not 0 !ParsedTablePagefirstPage =reader.parsePdfTablePage(pdfDoc,1);// getting page numberassertfirstPage.getPageNum() ==1;// rows and cells are zero-indexed just like elements of the List// getting first rowParsedTablePage.ParsedTableRowfirstRow =firstPage.getRow(0);// getting third cell in second rowStringthirdCellContent =firstPage.getRow(1).getCell(2);// cell content usually contain <CR><LF> characters,// so it is recommended to trim them before processingdoublethirdCellNumericValue =Double.valueOf(thirdCellContent.trim());

About

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

Languages

Java100.0%

Movatterモバイル変換

License

rostrovsky/pdf-table

Folders and files

Latest commit

History

Repository files navigation

PDF-table

What is PDF-table?

Prerequisites

JDK

External dependencies

Installation

Usage

Parsing PDFs

single-threaded example

multi-threaded example

Saving PDF pages as PNG images

single-threaded example

multi-threaded example

Saving debug PNG images

single-threaded example

multi-threaded example

Parsing settings

Output format

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages