- Notifications
You must be signed in to change notification settings - Fork4.1k
io: introduce URI-based IO layer with optional s3 backend#4040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Open
eric-ozim wants to merge3 commits intoopendatalab:masterChoose a base branch fromozim-ai:feature/parse_file_enhancement
base:master
Could not load branches
Branch not found:{{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline, and old review comments may become outdated.
+212 −3
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
Contributor
github-actionsbot commentedNov 21, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
All contributors have signed the CLA ✍️ ✅ |
Author
eric-ozim commentedNov 21, 2025
I have read the CLA Document and I hereby sign the CLA |
a3020a0 to67d2dcbCompareSign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description
Feature: URI-based I/O helpers with optional S3 backend
This PR introduces a URI-based I/O utility layer to support local and S3 storage in a reusable way, and makes the S3 backend an optional, lazily imported dependency. This is a pure infrastructure enhancement and does not change any FastAPI endpoints.
What Changed
New Files
mineru/data/utils/uri_io.py – Core URI-based helpers:
read_bytes_from_uri (local path + s3:// / s3a://)
prepare_output_dir
upload_parse_dir_to_s3
cleanup_temp_dir
tests/test_uri_io.py – Unit tests for local/S3 URI handling and output directory selection
Modified Files
pyproject.toml
Moved boto3 from core dependencies to an optional extra: mineru[s3]
Ensured mineru[core] pulls in mineru[s3] so the “full install” still has S3 support
mineru/data/io/s3.py
Wrapped boto3 import in a guarded lazy import with a clear error message:
If S3 is used without installing mineru[s3], users get a precise ImportError telling them how to enable it
URI I/O Logic
How It Works
URI-based reading (read_bytes_from_uri)
Local paths (no ://): use existing read_fn(Path(...)) to handle PDF/images as before
S3 URIs (s3:// / s3a://):
Validate S3 backend availability (_require_s3_backend)
Read bytes via S3Reader using env-based config (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_ENDPOINT_URL, optional S3_ADDRESSING_STYLE)
Other schemes (http://, https://, file://, etc.):
Explicitly rejected with a ValueError("Unsupported URI scheme ... Only local paths and s3:// are supported.")
Output directory selection (prepare_output_dir)
S3 output (output_uri starts with s3:// / s3a://):
Allocate a temporary local directory with tempfile.mkdtemp
Return (temp_dir, is_s3_output=True, normalized_output_uri=output_uri)
Local output / no URI:
Ensure fallback_local_dir exists
Return (fallback_local_dir, is_s3_output=False, normalized_output_uri)
Uploading parse results to S3 (upload_parse_dir_to_s3)
Requires S3 backend (mineru[s3] installed)
Uses S3DataWriter to mirror the full subtree under local_parse_dir into the target S3 prefix
Returns the final S3 “parse_dir” URI (s3://bucket/prefix/) for use in API responses
Temp directory cleanup (cleanup_temp_dir)
Best-effort shutil.rmtree with warning logs on failure
Designed to be safe for use in FastAPI BackgroundTask
Key Components
mineru.data.utils.uri_io.read_bytes_from_uri:
Normalizes input handling for local paths and S3 URIs
Provides clear error messages for unsupported schemes and missing S3 backend
mineru.data.utils.uri_io.prepare_output_dir:
Centralizes decision “write to real local directory vs temp dir for S3 upload”
mineru.data.utils.uri_io.upload_parse_dir_to_s3:
Generic “upload a whole parse directory to S3” helper, independent of API layer
Optional S3 backend (mineru[s3]):
pyproject.toml defines s3 extra with boto3
mineru/data/io/s3.py guards imports and gives a direct hint: pip install "mineru[s3]"
Checklist
Code Quality
[x] Code follows existing project style and layout conventions
[x] Self-review performed for uri_io.py, s3.py, and pyproject.toml changes
[x] New helpers and edge cases are documented in code comments
[x] Changes introduce no new linter errors in touched files
Testing
[x] pytest tests/test_uri_io.py passes locally
[x] Verified local-path reading works against tests/unittest/pdfs/test.pdf
[x] Verified unsupported schemes (http://...) raise clear ValueError
[x] Verified S3 access without backend raises clear ImportError pointing to mineru[s3]
[x] Verified prepare_output_dir behavior for both local and S3 output modes
Documentation
[ ] (To be done in follow-up PR) Public docs/update for new URI-based behavior when integrating into /file_parse
[x] Internal behavior and expectations are documented in uri_io.py docstrings and tests
🧪 Testing Guide
Basic Tests:
Ensure mineru is installed in editable mode with test extras:
pip install -e ".[test]"
Run the new tests:
pytest tests/test_uri_io.py
Scenarios Covered:
Local read:
read_bytes_from_uri("tests/unittest/pdfs/test.pdf") returns non-empty bytes
Unsupported scheme:
read_bytes_from_uri("http://example.com/foo.pdf") raises ValueError with “Unsupported URI scheme” and “Only local paths and s3://”
Missing S3 backend:
With S3 backend disabled/missing, calling read_bytes_from_uri("s3://bucket/key") raises ImportError mentioning pip install "mineru[s3]".
Output dir selection:
prepare_output_dir(None, "./output") returns a real local directory
prepare_output_dir("s3://my-bucket/prefix", "./output") returns a temp dir with is_s3_output=True and normalized_output_uri unchanged