- Notifications
You must be signed in to change notification settings - Fork35
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
License
open-sciencelab/GraphGen
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
📚 Table of Contents
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check thepaper andbest practice.
Here is post-training result whichover 50% SFT data comes from GraphGen and our data clean pipeline.
Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) |
---|---|---|---|
Plant | SeedBench | 65.9 | 51.5 |
Common | CMMLU | 73.6 | 75.8 |
Knowledge | GPQA-Diamond | 40.0 | 33.3 |
Math | AIME24 | 20.6 | 16.7 |
AIME25 | 22.7 | 7.2 |
It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
After data generation, you can useLLaMA-Factory andxtuner to finetune your LLMs.
- 2025.09.29: We auto-update gradio demo onHugging Face andModelScope.
- 2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
- 2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
- 2025.04.21: We have released the initial version of GraphGen.
Experience GraphGen throughWeb orBackup Web Entrance
For any questions, please checkFAQ, open newissue or join ourwechat group and ask.
Installuv
# You could try pipx or pip to install uv when meet network issues, refer the uv doc for more detailscurl -LsSf https://astral.sh/uv/install.sh| sh
Clone the repository
git clone --depth=1 https://github.com/open-sciencelab/GraphGencd GraphGen
Create a new uv environment
uv venv --python 3.10
Configure the dependencies
uv pip install -r requirements.txt
python -m webui.app
Install GraphGen
uv pip install graphg
Run in CLI
SYNTHESIZER_MODEL=your_synthesizer_model_name \SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \TRAINEE_MODEL=your_trainee_model_name \TRAINEE_BASE_URL=your_base_url_for_trainee_model \TRAINEE_API_KEY=your_api_key_for_trainee_model \graphg --output_dir cache
Configure the environment
- Create an
.env
file in the root directorycp .env.example .env
- Set the following environment variables:
# Synthesizer is the model used to construct KG and generate dataSYNTHESIZER_MODEL=your_synthesizer_model_nameSYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_modelSYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model# Trainee is the model used to train with the generated dataTRAINEE_MODEL=your_trainee_model_nameTRAINEE_BASE_URL=your_base_url_for_trainee_modelTRAINEE_API_KEY=your_api_key_for_trainee_model
- Create an
(Optional) Customize generation parameters in
graphgen/configs/
folder.Edit the corresponding YAML file, e.g.:
# configs/cot_config.yamlinput_file:resources/input_examples/jsonl_demo.jsonloutput_data_type:cottokenizer:cl100k_base# additional settings...
Generate data
Pick the desired format and run the matching script:
Format Script to run Notes cot
bash scripts/generate/generate_cot.sh
Chain-of-Thought Q&A pairs atomic
bash scripts/generate/generate_atomic.sh
Atomic Q&A pairs covering basic knowledge aggregated
bash scripts/generate/generate_aggregated.sh
Aggregated Q&A pairs incorporating complex, integrated knowledge multi-hop
bash scripts/generate/generate_multihop.sh
Multi-hop reasoning Q&A pairs Get the generated data
ls cache/data/graphgen
- Build the Docker image
docker build -t graphgen.
- Run the Docker container
docker run -p 7860:7860 graphgen
Seeanalysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.
- SiliconFlow Abundant LLM API, some models are free
- LightRAG Simple and efficient graph retrieval solution
- ROGRAG A robustly optimized GraphRAG framework
- DB-GPT An AI native data app development framework
If you find this repository useful, please consider citing our work:
@misc{chen2025graphgenenhancingsupervisedfinetuning,title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation},author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},year={2025},eprint={2505.20416},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2505.20416}, }
This project is licensed under theApache License 2.0.
About
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.