OpenCSGs/csghub-dataflowPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star4

OpenCSG dataflow is a one-stop data processing platform designed to leverage large model technology and advanced algorithms to optimize the entire data processing lifecycle, enhancing efficiency and precision, while addressing enterprise challenges in data management such as inefficiency, adaptability gaps, and security and compliance issues.

License

GPL-3.0 license

4 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github/workflows		.github/workflows
attach/operator		attach/operator
configs		configs
data_celery		data_celery
data_engine		data_engine
data_server		data_server
demo/tmp/_git		demo/tmp/_git
demos		demos
docker		docker
docs		docs
environments		environments
scripts		scripts
tests		tests
thirdparty		thirdparty
tool_legacy		tool_legacy
.env-dev		.env-dev
.env-prd		.env-prd
.env-stg		.env-stg
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile-celery		Dockerfile-celery
Dockerfile-server		Dockerfile-server
JobWorkflowExecutor.py		JobWorkflowExecutor.py
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
app.py		app.py
docker-compose-build.yml		docker-compose-build.yml
docker-compose-celery.yml		docker-compose-celery.yml
docker-compose-prd.yml		docker-compose-prd.yml
docker-compose-stg.yml		docker-compose-stg.yml
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
run_api_server.sh		run_api_server.sh
run_celery_linux.sh		run_celery_linux.sh
run_celery_windows.bat		run_celery_windows.bat
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

csghub-dataflow

DataFlow is an open-source platform engineered to streamline end-to-end data processing within the AI/ML lifecycle. By unifying data workflows and model optimization, it transforms fragmented pipelines into a cohesive, automated system—ideal for enterprises tackling data complexity at scale.

🔑 Key Features

Full Lifecycle Management
- Unified handling of data ingestion, transformation, modeling, and evaluation.
Seamless CSGHub Integration
- Directly ingest datasets from CSGHub and push refined data back for model retraining, creating a continuous feedback loop .
Modular & Extensible Design
- Plug-and-play operators for custom pipelines (e.g., NLP, image, audio processing).
Distributed Computing
- Scale workloads across clusters via Kubernetes integration .
Multi-Agent Task Orchestration
- Dynamically allocate complex tasks (e.g., data validation, anomaly detection) to collaborative agents.
MinerU Engine
- Convert PDFs to structured Markdown/JSON for LLM-friendly datasets .
Growing Operator Library
- Expandable support for multimodal data (text, image, video) and domain-specific transformations.

🔗 Acknowledgements

This project is built uponData Juicer. We sincerely thank the Data Juicer team for their impactful work in data engineering.

📜 License

This project inherits theApache License 2.0 from Data Juicer.

🚀 Quick Start

Building data-flow from Source

docker build -t dataflow . -f Dockerfiledocker buildx build --provenance false --platform linux/amd64 -t dataflow . -f Dockerfiledocker buildx build --provenance false --platform linux/arm64 -t dataflow . -f Dockerfile

Prerequisites

Launch postgres container

docker run -d --name dataflow-pg \   -p 5433:5432 \   -v /tmp/data_flow/pgdata:/var/lib/postgresql/data \   -e POSTGRES_DB=data_flow \   -e POSTGRES_USER=postgres \   -e POSTGRES_PASSWORD=postgres \   opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/csghub/postgres:15.10

Launch mongoDB container

docker run -d --name dataflow-mongo \   -p 27017:27017 \   -v /tmp/data_flow/mongodata:/data/db \   -e MONGO_INITDB_ROOT_USERNAME=root \   -e MONGO_INITDB_ROOT_PASSWORD=example \   opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/mongo:8.0.12

Launch redis container

docker run -d --name dataflow-redis \   -p 16379:6379 \   -v /tmp/data_flow/redisdata:/data \   opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/redis:7.2.5

Installation data-flow

docker run -d --name dataflow-api -p 8000:8000 \   -v /tmp/data_flow/apidata:/data/dataflow_data \   -c"uvicorn data_server.main:app --host 0.0.0.0 --port 8000" \   -e DATA_DIR=/data/dataflow_data \   -e CSGHUB_ENDPOINT=https://hub.opencsg.com \   -e MAX_WORKERS=99 \   -e RAY_ADDRESS=auto \   -e RAY_ENABLE=False \   -e RAY_LOG_DIR=/data/ray_output \   -e API_SERVER=0.0.0.0 \   -e API_PORT=8000 \   -e ENABLE_OPENTELEMETRY=False \   -e DATABASE_DB=data_flow \   -e DATABASE_USERNAME=postgres \   -e DATABASE_PASSWORD=postgres \   -e DATABASE_HOSTNAME=127.0.0.1 \   -e DATABASE_PORT=5433 \   -e STUDIO_JUMP_URL=https://data-label.opencsg.com \   -e REDIS_HOST_URL=redis://127.0.0.1:16379 \   -e MONG_HOST_URL=mongodb://root:example@127.0.0.1:27017 \   dataflow

Installation data-flow-celery

docker run -d --name celery-work -p 8001:8001 \   -v /tmp/data_flow/celery-data:/data/dataflow_celery \   -c"celery -A data_celery.main:celery_app worker --loglevel=info --pool=gevent" \   -e DATA_DIR=/data/dataflow_celery \   -e CSGHUB_ENDPOINT=https://hub.opencsg.com \   -e MAX_WORKERS=99 \   -e RAY_ADDRESS=auto \   -e RAY_ENABLE=False \   -e RAY_LOG_DIR=/data/ray_output \   -e API_SERVER=0.0.0.0 \   -e API_PORT=8001 \   -e ENABLE_OPENTELEMETRY=False \   -e DATABASE_DB=data_flow \   -e DATABASE_USERNAME=postgres \   -e DATABASE_PASSWORD=postgres \   -e DATABASE_HOSTNAME=127.0.0.1 \   -e DATABASE_PORT=5433 \   -e REDIS_HOST_URL=redis://127.0.0.1:16379 \   -e MONG_HOST_URL=mongodb://root:example@127.0.0.1:27017 \   dataflow-celery

Run data-flow server in development mode locally

Create a Virtual Environment

uv venv --python 3.10source .venv/bin/activate# orconda create -n  dataflow python=3.10

# Install dependencies#pip install '.[dist]' -i https://pypi.tuna.tsinghua.edu.cn/simple/#pip install '.[tools]' -i https://pypi.tuna.tsinghua.edu.cn/simple/#pip install '.[sci]' -i https://pypi.tuna.tsinghua.edu.cn/simple/#pip install -r docker/requirements.txtuv pip install -r docker/dataflow_requirements.txt -i https://mirrors.aliyun.com/pypi/simple/# Run the server locallyuvicorn data_server.main:app --reload

Run data-flow-celery server in development mode locally

# Run the celery server locallycelery -A data_celery.main:celery_app worker --loglevel=info --pool=gevent

Notes:

kenlm,simhash-pybind,opencc==1.1.8,imagededup in fileenvironments/science_requires.txt are only support X86 platform. Remove them if you are using ARM platform.
The configuration information ofREDIS_HOST_URL andMONG_HOST_URL indata-flow anddata-flow-celery must be consistent.
If you want to use the data annotation service, please install and enable theLabel Studio service. Additionally, you need to set theSTUDIO_JUMP_URL variable of thedata-flow service to the address of theLabel Studio service.

🛣️ Roadmap

Upcoming:

Enhanced real-time data streaming
AutoML integration for automated model tuning
Cross-cloud synchronization
Support more data sources

🤝 Contributing

We welcome contributions!

📞 Contact

For support or queries:

Email:community@opencsg.com
GitHub:OpenCSG/DataFlow

About

Languages

Python99.7%
Other0.3%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

csghub-dataflow

🔗 Acknowledgements

📜 License

🚀 Quick Start

Building data-flow from Source

Prerequisites

Installation data-flow

Installation data-flow-celery

Run data-flow server in development mode locally

Create a Virtual Environment

Run data-flow-celery server in development mode locally

🛣️ Roadmap

🤝 Contributing

📞 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Contributors7

Uh oh!

Languages

Movatterモバイル変換

License

OpenCSGs/csghub-dataflow

Folders and files

Latest commit

History

Repository files navigation

csghub-dataflow

🔗 Acknowledgements

📜 License

🚀 Quick Start

Building data-flow from Source

Prerequisites

Installation data-flow

Installation data-flow-celery

Run data-flow server in development mode locally

Create a Virtual Environment

Run data-flow-celery server in development mode locally

🛣️ Roadmap

🤝 Contributing

📞 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Contributors7

Uh oh!

Languages

Packages