- Notifications
You must be signed in to change notification settings - Fork3
OpenCSG dataflow is a one-stop data processing platform designed to leverage large model technology and advanced algorithms to optimize the entire data processing lifecycle, enhancing efficiency and precision, while addressing enterprise challenges in data management such as inefficiency, adaptability gaps, and security and compliance issues.
License
OpenCSGs/csghub-dataflow
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
OpenCSG dataflow is a one-stop data processing platform designed to leverage large model technology and advanced algorithms to optimize the entire data processing lifecycle, enhancing efficiency and precision, while addressing enterprise challenges in data management such as inefficiency, adaptability gaps, and security and compliance issues.
DataFlow is an open-source platform engineered to streamline end-to-end data processing within the AI/ML lifecycle. By unifying data workflows and model optimization, it transforms fragmented pipelines into a cohesive, automated system—ideal for enterprises tackling data complexity at scale.
🔑 Key Features
- Full Lifecycle Management
- Unified handling of data ingestion, transformation, modeling, and evaluation.
- Seamless CSGHub Integration
- Directly ingest datasets from CSGHub and push refined data back for model retraining, creating a continuous feedback loop .
- Modular & Extensible Design
- Plug-and-play operators for custom pipelines (e.g., NLP, image, audio processing).
- Distributed Computing
- Scale workloads across clusters via Kubernetes integration .
- Multi-Agent Task Orchestration
- Dynamically allocate complex tasks (e.g., data validation, anomaly detection) to collaborative agents.
- MinerU Engine
- Convert PDFs to structured Markdown/JSON for LLM-friendly datasets .
- Growing Operator Library
- Expandable support for multimodal data (text, image, video) and domain-specific transformations.
This project is built uponData Juicer. We sincerely thank the Data Juicer team for their impactful work in data engineering.
This project inherits theApache License 2.0 from Data Juicer.
docker build -t dataflow . -f Dockerfiledocker buildx build --provenance false --platform linux/amd64 -t dataflow . -f Dockerfiledocker buildx build --provenance false --platform linux/arm64 -t dataflow . -f DockerfileLaunch postgres container
docker run -d --name dataflow-pg \ -p 5433:5432 \ -v /tmp/data_flow/pgdata:/var/lib/postgresql/data \ -e POSTGRES_DB=data_flow \ -e POSTGRES_USER=postgres \ -e POSTGRES_PASSWORD=postgres \ opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/csghub/postgres:15.10
Launch mongoDB container
docker run -d --name dataflow-mongo \ -p 27017:27017 \ -v /tmp/data_flow/mongodata:/data/db \ -e MONGO_INITDB_ROOT_USERNAME=root \ -e MONGO_INITDB_ROOT_PASSWORD=example \ opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/mongo:8.0.12
Launch redis container
docker run -d --name dataflow-redis \ -p 16379:6379 \ -v /tmp/data_flow/redisdata:/data \ opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/redis:7.2.5
docker run -d --name dataflow-api -p 8000:8000 \ -v /tmp/data_flow/apidata:/data/dataflow_data \ -c"uvicorn data_server.main:app --host 0.0.0.0 --port 8000" \ -e DATA_DIR=/data/dataflow_data \ -e CSGHUB_ENDPOINT=https://hub.opencsg.com \ -e MAX_WORKERS=99 \ -e RAY_ADDRESS=auto \ -e RAY_ENABLE=False \ -e RAY_LOG_DIR=/data/ray_output \ -e API_SERVER=0.0.0.0 \ -e API_PORT=8000 \ -e ENABLE_OPENTELEMETRY=False \ -e DATABASE_DB=data_flow \ -e DATABASE_USERNAME=postgres \ -e DATABASE_PASSWORD=postgres \ -e DATABASE_HOSTNAME=127.0.0.1 \ -e DATABASE_PORT=5433 \ -e STUDIO_JUMP_URL=https://data-label.opencsg.com \ -e REDIS_HOST_URL=redis://127.0.0.1:16379 \ -e MONG_HOST_URL=mongodb://root:example@127.0.0.1:27017 \ dataflowdocker run -d --name celery-work -p 8001:8001 \ -v /tmp/data_flow/celery-data:/data/dataflow_celery \ -c"celery -A data_celery.main:celery_app worker --loglevel=info --pool=gevent" \ -e DATA_DIR=/data/dataflow_celery \ -e CSGHUB_ENDPOINT=https://hub.opencsg.com \ -e MAX_WORKERS=99 \ -e RAY_ADDRESS=auto \ -e RAY_ENABLE=False \ -e RAY_LOG_DIR=/data/ray_output \ -e API_SERVER=0.0.0.0 \ -e API_PORT=8001 \ -e ENABLE_OPENTELEMETRY=False \ -e DATABASE_DB=data_flow \ -e DATABASE_USERNAME=postgres \ -e DATABASE_PASSWORD=postgres \ -e DATABASE_HOSTNAME=127.0.0.1 \ -e DATABASE_PORT=5433 \ -e REDIS_HOST_URL=redis://127.0.0.1:16379 \ -e MONG_HOST_URL=mongodb://root:example@127.0.0.1:27017 \ dataflow-celeryuv venv --python 3.10source .venv/bin/activate# orconda create -n dataflow python=3.10
# Install dependencies#pip install '.[dist]' -i https://pypi.tuna.tsinghua.edu.cn/simple/#pip install '.[tools]' -i https://pypi.tuna.tsinghua.edu.cn/simple/#pip install '.[sci]' -i https://pypi.tuna.tsinghua.edu.cn/simple/#pip install -r docker/requirements.txtuv pip install -r docker/dataflow_requirements.txt -i https://mirrors.aliyun.com/pypi/simple/# Run the server locallyuvicorn data_server.main:app --reload
# Run the celery server locallycelery -A data_celery.main:celery_app worker --loglevel=info --pool=geventNotes:
kenlm,simhash-pybind,opencc==1.1.8,imagededupin fileenvironments/science_requires.txtare only support X86 platform. Remove them if you are using ARM platform.- The configuration information of
REDIS_HOST_URLandMONG_HOST_URLindata-flowanddata-flow-celerymust be consistent. - If you want to use the data annotation service, please install and enable theLabel Studio service. Additionally, you need to set the
STUDIO_JUMP_URLvariable of thedata-flowservice to the address of theLabel Studioservice.
Upcoming:
- Enhanced real-time data streaming
- AutoML integration for automated model tuning
- Cross-cloud synchronization
- Support more data sources
We welcome contributions!
For support or queries:
- Email:community@opencsg.com
- GitHub:OpenCSG/DataFlow
About
OpenCSG dataflow is a one-stop data processing platform designed to leverage large model technology and advanced algorithms to optimize the entire data processing lifecycle, enhancing efficiency and precision, while addressing enterprise challenges in data management such as inefficiency, adaptability gaps, and security and compliance issues.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Contributors7
Uh oh!
There was an error while loading.Please reload this page.