Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

License

NotificationsYou must be signed in to change notification settings

open-sciencelab/GraphGen

Repository files navigation

starsforksopen issuesissue resolutiondocumentationpypiwechatarXivHugging Face

Hugging FaceModel ScopeOpenXLab

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

English |中文

📚 Table of Contents

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check thepaper andbest practice.

Here is post-training result whichover 50% SFT data comes from GraphGen and our data clean pipeline.

DomainDatasetOursQwen2.5-7B-Instruct (baseline)
PlantSeedBench65.951.5
CommonCMMLU73.675.8
KnowledgeGPQA-Diamond40.033.3
MathAIME2420.616.7
AIME2522.77.2

It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

After data generation, you can useLLaMA-Factory andxtuner to finetune your LLMs.

📌 Latest Updates

  • 2025.09.29: We auto-update gradio demo onHugging Face andModelScope.
  • 2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
  • 2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
  • 2025.04.21: We have released the initial version of GraphGen.

🚀 Quick Start

Experience GraphGen throughWeb orBackup Web Entrance

For any questions, please checkFAQ, open newissue or join ourwechat group and ask.

Preparation

  1. Installuv

    # You could try pipx or pip to install uv when meet network issues, refer the uv doc for more detailscurl -LsSf https://astral.sh/uv/install.sh| sh
  2. Clone the repository

    git clone --depth=1 https://github.com/open-sciencelab/GraphGencd GraphGen
  3. Create a new uv environment

     uv venv --python 3.10
  4. Configure the dependencies

    uv pip install -r requirements.txt

Run Gradio Demo

python -m webui.app

ui

Run from PyPI

  1. Install GraphGen

    uv pip install graphg
  2. Run in CLI

    SYNTHESIZER_MODEL=your_synthesizer_model_name \SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \TRAINEE_MODEL=your_trainee_model_name \TRAINEE_BASE_URL=your_base_url_for_trainee_model \TRAINEE_API_KEY=your_api_key_for_trainee_model \graphg --output_dir cache

Run from Source

  1. Configure the environment

    • Create an.env file in the root directory
      cp .env.example .env
    • Set the following environment variables:
      # Synthesizer is the model used to construct KG and generate dataSYNTHESIZER_MODEL=your_synthesizer_model_nameSYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_modelSYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model# Trainee is the model used to train with the generated dataTRAINEE_MODEL=your_trainee_model_nameTRAINEE_BASE_URL=your_base_url_for_trainee_modelTRAINEE_API_KEY=your_api_key_for_trainee_model
  2. (Optional) Customize generation parameters ingraphgen/configs/ folder.

    Edit the corresponding YAML file, e.g.:

    # configs/cot_config.yamlinput_file:resources/input_examples/jsonl_demo.jsonloutput_data_type:cottokenizer:cl100k_base# additional settings...
  3. Generate data

    Pick the desired format and run the matching script:

    FormatScript to runNotes
    cotbash scripts/generate/generate_cot.shChain-of-Thought Q&A pairs
    atomicbash scripts/generate/generate_atomic.shAtomic Q&A pairs covering basic knowledge
    aggregatedbash scripts/generate/generate_aggregated.shAggregated Q&A pairs incorporating complex, integrated knowledge
    multi-hopbash scripts/generate/generate_multihop.shMulti-hop reasoning Q&A pairs
  4. Get the generated data

    ls cache/data/graphgen

Run with Docker

  1. Build the Docker image
    docker build -t graphgen.
  2. Run the Docker container
     docker run -p 7860:7860 graphgen

🏗️ System Architecture

Seeanalysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

workflow

🍀 Acknowledgements

  • SiliconFlow Abundant LLM API, some models are free
  • LightRAG Simple and efficient graph retrieval solution
  • ROGRAG A robustly optimized GraphRAG framework
  • DB-GPT An AI native data app development framework

📚 Citation

If you find this repository useful, please consider citing our work:

@misc{chen2025graphgenenhancingsupervisedfinetuning,title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation},author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},year={2025},eprint={2505.20416},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2505.20416}, }

📜 License

This project is licensed under theApache License 2.0.

📅 Star History

Star History Chart


[8]ページ先頭

©2009-2025 Movatter.jp