Posted onDec 19, 2024 • Edited onJan 18

Multiple document conversion using Docling and a GUI

Introduction

In a previous post I described how I began to put my hands on Docling and make my very first steps (My first hands-on experience with Docling). In that first step I used ‘Tkinter’ framework to add a GUI so I could choose a file and convert it to divers formats using Docling.

Natively, Docling provided batch conversions through the following command line examples;

# Convert a single file to Markdown (default)docling myfile.pdf# Convert a single file to Markdown and JSON, without OCRdocling myfile.pdf --to json --to md --no-ocr# Convert PDF files in input directory to Markdown (default)docling ./input/dir --from pdf# Convert PDF and Word files in input directory to Markdown and JSONdocling ./input/dir --from pdf --from docx --to md --to json --output ./scratch# Convert all supported files in input directory to Markdown, but abort on first errordocling ./input/dir --output ./scratch --abort-on-error

I wanted to do almost the same thing with a GUI. So this is my 2nd attemps to do my own stuff with Docling for multiple type document conversions.

For sure, I am still using Thinker 🤭

Implementation

I started over my previous code and change the file selection dialog box to multiple files.

# open-file dialogroot = tk.Tk()filenames = tk.filedialog.askopenfilenames(    title='Select files (pdf, pptx, docx, md, img)..',    filetypes=filetypes,)    if filenames:    print("Selected files:")    for filename in filenames:        print(filename)else:    print("No files selected.")     quit()root.destroy()

So the code becomes the following.

import jsonimport loggingimport timefrom pathlib import Pathfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackendfrom docling.datamodel.base_models import InputFormatfrom docling.datamodel.pipeline_options import PdfPipelineOptionsfrom docling.document_converter import DocumentConverter, PdfFormatOptionfrom docling.models.ocr_mac_model import OcrMacOptionsfrom docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptionsfrom docling.models.tesseract_ocr_model import TesseractOcrOptionsfrom docling.backend.pypdfium2_backend import PyPdfiumDocumentBackendfrom docling.datamodel.base_models import InputFormatfrom docling.document_converter import (    DocumentConverter,    PdfFormatOption,    WordFormatOption,)from docling.pipeline.simple_pipeline import SimplePipelinefrom docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline## GUI for file selection with thinker import tkinter as tkfrom tkinter import filedialog## filetypes for thinker dialog boxfiletypes = (    ('PDF files', '*.PDF'),    ('Word file', '*.DOCX'),    ('Powerpoint file', '*.PPTX'),    ('HTML file', '*.HTML'),    ('IMAGE file', '*.PNG'),    ('IMAGE file', '*.JPG'),    ('IMAGE file', '*.JPEG'),    ('IMAGE file', '*.GIF'),    ('IMAGE file', '*.BMP'),    ('IMAGE file', '*.TIFF'),    ('MD file', '*.MD'),)# open-file dialogroot = tk.Tk()filenames = tk.filedialog.askopenfilenames(    title='Select files (pdf, pptx, docx, md, img)..',    filetypes=filetypes,)    if filenames:    print("Selected files:")    for filename in filenames:        print(filename)else:    print("No files selected.")     quit()root.destroy()_log = logging.getLogger(__name__)def main():    logging.basicConfig(level=logging.INFO)# Docling Parse with EasyOCR# ----------------------pipeline_options = PdfPipelineOptions()pipeline_options.do_ocr = Truepipeline_options.do_table_structure = Truepipeline_options.table_structure_options.do_cell_matching = Truedoc_converter = (    DocumentConverter(  # all of the below is optional, has internal defaults.        allowed_formats=[            InputFormat.PDF,            InputFormat.IMAGE,            InputFormat.DOCX,            InputFormat.HTML,            InputFormat.PPTX,            InputFormat.ASCIIDOC,            InputFormat.MD,        ],  # whitelist formats, non-matching files are ignored.        format_options={            #InputFormat.PDF: PdfFormatOption(            #    pipeline_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend            #),            #InputFormat.DOCX: WordFormatOption(            #    pipeline_cls=SimplePipeline  # , backend=MsWordDocumentBackend            #),            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),                      },    ))    for filename in filenames:    input_doc_path = filename    start_time = time.time()    conv_result = doc_converter.convert(input_doc_path)    end_time = time.time() - start_time    _log.info(f"Document converted in {end_time:.2f} seconds.")    ## Export results    output_dir = Path("scratch")    output_dir.mkdir(parents=True, exist_ok=True)    doc_filename = conv_result.input.file.stem    # Export Deep Search document JSON format:    with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:        fp.write(json.dumps(conv_result.document.export_to_dict()))    # Export Text format:    with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:        fp.write(conv_result.document.export_to_text())    # Export Markdown format:    with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:        fp.write(conv_result.document.export_to_markdown())    # Export Document Tags format:    with (output_dir / f"{doc_filename}.doctags").open("w", encoding="utf-8") as fp:        fp.write(conv_result.document.export_to_document_tokens())if __name__ == "__main__":    main()

The GUI;

The terminal output;

2024-12-19 13:22:50.682 Python[62628:4807197] +[IMKClient subclass]: chose IMKClient_Modern2024-12-19 13:22:51.214 Python[62628:4807197] The class 'NSOpenPanel' overrides the method identifier.  This method is implemented by class 'NSWindow'Selected files:/Users/alainairom/Docling_test/file-selection.png/Users/alainairom/Docling_test/file-selection2.png/Users/alainairom/Docling_test/file-selection3.png/Users/alainairom/Docling_test/mobicheckin_server_event_guest_category_66968af1fc394000725041be_badge_template_66a37cdae369f921e572e4fa_1732039981_NZYRUCY.pdf/Users/alainairom/Docling_test/scratch.png/Users/alainairom/Docling_test/Screenshot at Dec 02 08-11-28.png

And the actual converted docs…