Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

io: introduce URI-based IO layer with optional s3 backend#4040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
eric-ozim wants to merge3 commits intoopendatalab:master
base:master
Choose a base branch
Loading
fromozim-ai:feature/parse_file_enhancement

Conversation

@eric-ozim
Copy link

PR Description
Feature: URI-based I/O helpers with optional S3 backend
This PR introduces a URI-based I/O utility layer to support local and S3 storage in a reusable way, and makes the S3 backend an optional, lazily imported dependency. This is a pure infrastructure enhancement and does not change any FastAPI endpoints.
What Changed
New Files
mineru/data/utils/uri_io.py – Core URI-based helpers:
read_bytes_from_uri (local path + s3:// / s3a://)
prepare_output_dir
upload_parse_dir_to_s3
cleanup_temp_dir
tests/test_uri_io.py – Unit tests for local/S3 URI handling and output directory selection
Modified Files
pyproject.toml
Moved boto3 from core dependencies to an optional extra: mineru[s3]
Ensured mineru[core] pulls in mineru[s3] so the “full install” still has S3 support
mineru/data/io/s3.py
Wrapped boto3 import in a guarded lazy import with a clear error message:
If S3 is used without installing mineru[s3], users get a precise ImportError telling them how to enable it
URI I/O Logic
How It Works
URI-based reading (read_bytes_from_uri)
Local paths (no ://): use existing read_fn(Path(...)) to handle PDF/images as before
S3 URIs (s3:// / s3a://):
Validate S3 backend availability (_require_s3_backend)
Read bytes via S3Reader using env-based config (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_ENDPOINT_URL, optional S3_ADDRESSING_STYLE)
Other schemes (http://, https://, file://, etc.):
Explicitly rejected with a ValueError("Unsupported URI scheme ... Only local paths and s3:// are supported.")
Output directory selection (prepare_output_dir)
S3 output (output_uri starts with s3:// / s3a://):
Allocate a temporary local directory with tempfile.mkdtemp
Return (temp_dir, is_s3_output=True, normalized_output_uri=output_uri)
Local output / no URI:
Ensure fallback_local_dir exists
Return (fallback_local_dir, is_s3_output=False, normalized_output_uri)
Uploading parse results to S3 (upload_parse_dir_to_s3)
Requires S3 backend (mineru[s3] installed)
Uses S3DataWriter to mirror the full subtree under local_parse_dir into the target S3 prefix
Returns the final S3 “parse_dir” URI (s3://bucket/prefix/) for use in API responses
Temp directory cleanup (cleanup_temp_dir)
Best-effort shutil.rmtree with warning logs on failure
Designed to be safe for use in FastAPI BackgroundTask
Key Components
mineru.data.utils.uri_io.read_bytes_from_uri:
Normalizes input handling for local paths and S3 URIs
Provides clear error messages for unsupported schemes and missing S3 backend
mineru.data.utils.uri_io.prepare_output_dir:
Centralizes decision “write to real local directory vs temp dir for S3 upload”
mineru.data.utils.uri_io.upload_parse_dir_to_s3:
Generic “upload a whole parse directory to S3” helper, independent of API layer
Optional S3 backend (mineru[s3]):
pyproject.toml defines s3 extra with boto3
mineru/data/io/s3.py guards imports and gives a direct hint: pip install "mineru[s3]"

Checklist
Code Quality
[x] Code follows existing project style and layout conventions
[x] Self-review performed for uri_io.py, s3.py, and pyproject.toml changes
[x] New helpers and edge cases are documented in code comments
[x] Changes introduce no new linter errors in touched files
Testing
[x] pytest tests/test_uri_io.py passes locally
[x] Verified local-path reading works against tests/unittest/pdfs/test.pdf
[x] Verified unsupported schemes (http://...) raise clear ValueError
[x] Verified S3 access without backend raises clear ImportError pointing to mineru[s3]
[x] Verified prepare_output_dir behavior for both local and S3 output modes
Documentation
[ ] (To be done in follow-up PR) Public docs/update for new URI-based behavior when integrating into /file_parse
[x] Internal behavior and expectations are documented in uri_io.py docstrings and tests
🧪 Testing Guide
Basic Tests:
Ensure mineru is installed in editable mode with test extras:
pip install -e ".[test]"
Run the new tests:
pytest tests/test_uri_io.py
Scenarios Covered:
Local read:
read_bytes_from_uri("tests/unittest/pdfs/test.pdf") returns non-empty bytes
Unsupported scheme:
read_bytes_from_uri("http://example.com/foo.pdf") raises ValueError with “Unsupported URI scheme” and “Only local paths and s3://”
Missing S3 backend:
With S3 backend disabled/missing, calling read_bytes_from_uri("s3://bucket/key") raises ImportError mentioning pip install "mineru[s3]".
Output dir selection:
prepare_output_dir(None, "./output") returns a real local directory
prepare_output_dir("s3://my-bucket/prefix", "./output") returns a temp dir with is_s3_output=True and normalized_output_uri unchanged

@dosubotdosubotbot added the size:LThis PR changes 100-499 lines, ignoring generated files. labelNov 21, 2025
@github-actions
Copy link
Contributor

github-actionsbot commentedNov 21, 2025
edited
Loading

All contributors have signed the CLA ✍️ ✅
Posted by theCLA Assistant Lite bot.

@dosubotdosubotbot added the enhancementNew feature or request labelNov 21, 2025
@eric-ozim
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@eric-ozimeric-ozimforce-pushed thefeature/parse_file_enhancement branch froma3020a0 to67d2dcbCompareNovember 21, 2025 11:59
github-actionsbot added a commit that referenced this pull requestNov 21, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

enhancementNew feature or requestsize:LThis PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

@eric-ozim

[8]ページ先頭

©2009-2025 Movatter.jp