Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

License

NotificationsYou must be signed in to change notification settings

rostrovsky/pdf-table

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is PDF-table?

PDF-table is Java utility library that can be used for parsing tabular data in PDF documents.
Core processing of PDF documents is performed with utilization ofApache PDFBox andOpenCV.

Prerequisites

JDK

JAVA 8 is required.

External dependencies

pdf-table requires compiledOpenCV 3.4.2 to work properly:

  1. Download OpenCV v3.4.2 fromhttps://github.com/opencv/opencv/releases/tag/3.4.2

  2. Unpack it and add to your system PATH:

    • Windows:<opencv dir>\build\java\x64

    • Linux:TODO

Installation

<dependency>  <groupId>com.github.rostrovsky</groupId>  <artifactId>pdf-table</artifactId>  <version>1.0.0</version></dependency>

Usage

Parsing PDFs

When PDF document page is being parsed, following operations are performed:

  1. Page is converted to grayscale image [OpenCV].

  2. Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].

  3. Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].

  4. Contour mask is XORed with BIT image [OpenCV].

  5. Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].

  6. Final contours are drawn [OpenCV].

  7. Bounding rectangles are detected from final contours [OpenCV].

  8. PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived fromhttp://stackoverflow.com/a/23106594.

For more information about parsed output, refer toOutput format

single-threaded example

classSingleThreadParser {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();List<ParsedTablePage>parsed =reader.parsePdfTablePages(pdfDoc,1,pdfDoc.getNumberOfPages());    }}

multi-threaded example

classMultiThreadParser {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();// parse pages simultaneouslyExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<ParsedTablePage>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<ParsedTablePage>callable = () -> {ParsedTablePagepage =reader.parsePdfTablePage(pdfDoc,pageNum);returnpage;            };futures.add(executor.submit(callable));        }// collect parsed pagesList<ParsedTablePage>unsortedParsedPages =newArrayList<>(pdfDoc.getNumberOfPages());try {for (Future<ParsedTablePage>f :futures) {ParsedTablePagepage =f.get();unsortedParsedPages.add(page.getPageNum() -1,page);            }        }catch (Exceptione) {thrownewRuntimeException(e);        }// sort pages by pageNumList<ParsedTablePage>sortedParsedPages =unsortedParsedPages.stream()                .sorted((p1,p2) ->Integer.compare(p1.getPageNum(),p2.getPageNum())).collect(Collectors.toList());    }}

Saving PDF pages as PNG images

PDF-Table provides methods for saving PDF pages as PNG images.
Rendering DPI can be modified inPdfTableSettings (see:Parsing settings).

single-threaded example

classSingleThreadPNGDump {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PathoutputPath =Paths.get("C:","some_directory");PdfTableReaderreader =newPdfTableReader();reader.savePdfPagesAsPNG(pdfDoc,1,pdfDoc.getNumberOfPages(),outputPath);    }}

multi-threaded example

classMultiThreadPNGDump {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PathoutputPath =Paths.get("C:","some_directory");PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();ExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<Boolean>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<Boolean>callable = () -> {reader.savePdfPageAsPNG(pdfDoc,pageNum,outputPath);returntrue;            };futures.add(executor.submit(callable));        }try {for (Future<Boolean>f :futures) {f.get();            }        }catch (Exceptione) {thrownewRuntimeException(e);        }    }}

Saving debug PNG images

When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show pageat various stages of processing.
Using these images, user can adjustPdfTableSettings accordingly to achieve desired results(see:Parsing settings).

single-threaded example

classSingleThreadDebugImgsDump {publicstaticvoidmain(String[]args)throwsIOException {PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PathoutputPath =Paths.get("C:","some_directory");PdfTableReaderreader =newPdfTableReader();reader.savePdfTablePagesDebugImages(pdfDoc,1,pdfDoc.getNumberOfPages(),outputPath);    }}

multi-threaded example

classMultiThreadDebugImgsDump {publicstaticvoidmain(String[]args)throwsIOException {finalintTHREAD_COUNT =8;PathoutputPath =Paths.get("C:","some_directory");PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();ExecutorServiceexecutor =Executors.newFixedThreadPool(THREAD_COUNT);List<Future<Boolean>>futures =newArrayList<>();for (finalintpageNum :IntStream.rangeClosed(1,pdfDoc.getNumberOfPages()).toArray()) {Callable<Boolean>callable = () -> {reader.savePdfTablePagesDebugImage(pdfDoc,pageNum,outputPath);returntrue;            };futures.add(executor.submit(callable));        }try {for (Future<Boolean>f :futures) {f.get();            }        }catch (Exceptione) {thrownewRuntimeException(e);        }    }}

Parsing settings

PDF rendering and OpenCV filtering settings are stored inPdfTableSettings object.

Custom settings instance can be passed toPdfTableReader constructor when non-default values are needed:

(...)// build settings objectPdfTableSettingssettings =PdfTableSettings.getBuilder()                .setCannyFiltering(true)                .setCannyApertureSize(5)                .setCannyThreshold1(40)                .setCannyThreshold2(190.5)                .setPdfRenderingDpi(160)                .build();// pass settings to readerPdfTableReaderreader =newPdfTableReader(settings);

Output format

Each parsed PDF page is being returned asParsedTablePage object:

(...)PDDocumentpdfDoc =PDDocument.load(newFile("some.pdf"));PdfTableReaderreader =newPdfTableReader();// first page in document has index == 1, not 0 !ParsedTablePagefirstPage =reader.parsePdfTablePage(pdfDoc,1);// getting page numberassertfirstPage.getPageNum() ==1;// rows and cells are zero-indexed just like elements of the List// getting first rowParsedTablePage.ParsedTableRowfirstRow =firstPage.getRow(0);// getting third cell in second rowStringthirdCellContent =firstPage.getRow(1).getCell(2);// cell content usually contain <CR><LF> characters,// so it is recommended to trim them before processingdoublethirdCellNumericValue =Double.valueOf(thirdCellContent.trim());

About

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp