Movatterモバイル変換

open-sciencelab/GraphGenPublic

NotificationsYou must be signed in to change notification settings
Fork35
Star389

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

g-app-center-120612-6433-jpdvmvp.openxlab.space

License

Apache-2.0 license

389 stars 35 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 600 Commits
.github		.github
baselines		baselines
graphgen		graphgen
resources		resources
scripts		scripts
tests		tests
webui		webui
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_ZH.md		README_ZH.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

English |中文

📚 Table of Contents

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check thepaper andbest practice.

Here is post-training result whichover 50% SFT data comes from GraphGen and our data clean pipeline.

Domain	Dataset	Ours	Qwen2.5-7B-Instruct (baseline)
Plant	SeedBench	65.9	51.5
Common	CMMLU	73.6	75.8
Knowledge	GPQA-Diamond	40.0	33.3
Math	AIME24	20.6	16.7
	AIME25	22.7	7.2

It begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

After data generation, you can useLLaMA-Factory andxtuner to finetune your LLMs.

📌 Latest Updates

2025.09.29: We auto-update gradio demo onHugging Face andModelScope.
2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
2025.04.21: We have released the initial version of GraphGen.

🚀 Quick Start

Experience GraphGen throughWeb orBackup Web Entrance

For any questions, please checkFAQ, open newissue or join ourwechat group and ask.

Preparation

Installuv

# You could try pipx or pip to install uv when meet network issues, refer the uv doc for more detailscurl -LsSf https://astral.sh/uv/install.sh| sh

Clone the repository

git clone --depth=1 https://github.com/open-sciencelab/GraphGencd GraphGen

Create a new uv environment
```
 uv venv --python 3.10
```
Configure the dependencies
```
uv pip install -r requirements.txt
```

Run Gradio Demo

python -m webui.app

Run from PyPI

Install GraphGen
```
uv pip install graphg
```

Run in CLI

SYNTHESIZER_MODEL=your_synthesizer_model_name \SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \TRAINEE_MODEL=your_trainee_model_name \TRAINEE_BASE_URL=your_base_url_for_trainee_model \TRAINEE_API_KEY=your_api_key_for_trainee_model \graphg --output_dir cache

Run from Source

Configure the environment

Create an.env file in the root directory
```
cp .env.example .env
```

Set the following environment variables:

# Synthesizer is the model used to construct KG and generate dataSYNTHESIZER_MODEL=your_synthesizer_model_nameSYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_modelSYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model# Trainee is the model used to train with the generated dataTRAINEE_MODEL=your_trainee_model_nameTRAINEE_BASE_URL=your_base_url_for_trainee_modelTRAINEE_API_KEY=your_api_key_for_trainee_model

(Optional) Customize generation parameters ingraphgen/configs/ folder.

Edit the corresponding YAML file, e.g.:

# configs/cot_config.yamlinput_file:resources/input_examples/jsonl_demo.jsonloutput_data_type:cottokenizer:cl100k_base# additional settings...

Generate data

Pick the desired format and run the matching script:

Format	Script to run	Notes
`cot`	`bash scripts/generate/generate_cot.sh`	Chain-of-Thought Q&A pairs
`atomic`	`bash scripts/generate/generate_atomic.sh`	Atomic Q&A pairs covering basic knowledge
`aggregated`	`bash scripts/generate/generate_aggregated.sh`	Aggregated Q&A pairs incorporating complex, integrated knowledge
`multi-hop`	`bash scripts/generate/generate_multihop.sh`	Multi-hop reasoning Q&A pairs

Get the generated data
```
ls cache/data/graphgen
```

Run with Docker

Build the Docker image
```
docker build -t graphgen.
```
Run the Docker container
```
 docker run -p 7860:7860 graphgen
```

🏗️ System Architecture

Seeanalysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

🍀 Acknowledgements

SiliconFlow Abundant LLM API, some models are free
LightRAG Simple and efficient graph retrieval solution
ROGRAG A robustly optimized GraphRAG framework
DB-GPT An AI native data app development framework

📚 Citation

If you find this repository useful, please consider citing our work:

@misc{chen2025graphgenenhancingsupervisedfinetuning,title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation},author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},year={2025},eprint={2505.20416},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2505.20416}, }