RapidAI/RapidDocPublic

NotificationsYou must be signed in to change notification settings
Fork8
Star198

📝 针对文档类图像做内容提取，将文档类图像一比一输出到Word或者Txt中，便于进一步使用或处理。后续计划支持输入PDF/图像，输出对应json格式、Txt格式、Word格式和Markdown格式。

License

Apache-2.0 license

198 stars 8 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
rapid_doc		rapid_doc
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt
test_pdf_extract.py		test_pdf_extract.py

Repository files navigation

📃 Rapid Doc

🚀 Work In Progress
整体功能还没开发完哈！欢迎加入一起搞

📝 简介

该项目主要针对文档类图像做内容提取，将文档类图像一比一输出到Word或者Txt中，便于进一步使用或处理。后续计划支持输入PDF/图像，输出对应json格式、Txt格式、Word格式和Markdown格式。

🛠️ 整体框架

以下为整体框架依赖包，均为RapidAI出品。

flowchart TD    A[/文档图像/] --> B([文档方向分类 rapid_orientation]) --> C([版面分析 rapid_layout])    C --> D([表格识别 rapid_table]) & E([公式识别 rapid_latex_ocr]) & F([文字识别 rapidocr_onnxruntime]) --> G([版面还原 rapid_layout_recover])    G --> H[/结构化输出/]

📑 输入和输出

输入：文档类图像
输出：TXT或Word

💻 安装运行环境

pip install -r requirements.txt

🚀 运行Demo

git clone https://github.com/RapidAI/RapidDoc.gitcd RapidDocpython demo.py

📈 结果示例

⚠️注意：之所以提取结果没有分段，是因为版面分析模型没有段落检测功能。现有开源的所有版面分析模型都没有段落检测功能，这个后续会考虑自己训练一个版面分析模型来优化这里。

⭐ Star History

About

Releases1

assets Latest

Sep 10, 2024

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

📃 Rapid Doc

🚀 Work In Progress

📝 简介

🛠️ 整体框架

📑 输入和输出

💻 安装运行环境

🚀 运行Demo

📈 结果示例

⭐ Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Uh oh!

Languages