skitsanos/gemini-ocrPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star11

PDF Screenshot OCR Analysis with Google Gemini Pro

www.linkedin.com/pulse/pdf-screenshot-ocr-analysis-google-gemini-pro-evgenios-skitsanos-htapf

11 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
.gitignore		.gitignore
README.md		README.md
ocr.sh		ocr.sh
prompt.txt		prompt.txt
utils.sh		utils.sh

Repository files navigation

PDF Screenshot OCR Analysis with Google Gemini Pro

This project involves automating converting PDF document screenshots into text using Google's Gemini Pro model. The goal is to perform Optical Character Recognition (OCR) on images extracted from PDF screenshots to analyze and extract textual content.

Workflow Overview:

Screenshot Extraction: Images are taken from PDF documents and stored in a designated directory (data/).
Prompt Preparation: A text prompt is read fromprompt.txt, which instructs the model on how to process the images.
Image Processing:
- The script determines the MIME type for each image in the data/ directory and encodes it in Base64.
- These encoded images and the initial user prompt are incorporated into a JSON structure.
Generation Configuration: A generation configuration is created to fine-tune the model's processing parameters, such astopP andtemperature.
Payload Preparation: The JSON structure, including the images and configuration, is prepared as a payload for the API request.
API Request: The payload is sent to the Google Gemini Pro model via an API endpoint to perform OCR.
Response Handling
- The response, containing the extracted text and metadata, is saved toresponse.json.
- The textual content is extracted and saved toresponse.txt.

Key Components:

Image Processing: Functions to get MIME type and encode images in Base64.
JSON Structuring: Usingjq to build and modify JSON payloads.
API Integration: Sending the payload to Google Gemini Pro and handling the response.

Benefits:

Automation: Streamlines the process of converting PDF screenshots to text.
Accuracy: Leverages Google's advanced OCR capabilities for high-quality text extraction.
Flexibility: Configurable processing parameters to optimize OCR results.

This project is ideal for scenarios where automated text extraction from PDF screenshots is needed, such as digitizing documents, extracting data for analysis, or improving accessibility.

About

PDF Screenshot OCR Analysis with Google Gemini Pro

www.linkedin.com/pulse/pdf-screenshot-ocr-analysis-google-gemini-pro-evgenios-skitsanos-htapf

Languages

Shell100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PDF Screenshot OCR Analysis with Google Gemini Pro

Workflow Overview:

Key Components:

Benefits:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

Movatterモバイル変換

skitsanos/gemini-ocr

Folders and files

Latest commit

History

Repository files navigation

PDF Screenshot OCR Analysis with Google Gemini Pro

Workflow Overview:

Key Components:

Benefits:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages