DEEP-PolyU/Knowledge-to-SQLPublic

NotificationsYou must be signed in to change notification settings
Fork10
Star56

[ACL2024 Findings] Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM

License

MIT license

56 stars 10 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dataset		dataset
model		model
slides		slides
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

Knowledge-to-SQL

[2025/08] All the code has been released. This repository will no longer be updated. Please feel free to email or open an issue if you need any assistance.

[2024/10] Check our video presentation inUnderline!

[2024/08] The video presentation of our paper will be available soon.

[2024/08] The presentation of our paper are scheduled at Virtual Poster Session 2, check the poster and slideshere.

[2024/05] Our paper is accepted as a findings paper in ACL2024!

We propose a novel frameworkKnowledge-to-SQL that leveragesData Expert Large Language Model (DELLM) to enhance SQL generation, the paper is availablehere.

Setup

Environment

The GPU resources we use in our study is 4*A800-SXM4-80G with the corresponding CUDA version 12.1, we strongly recommend using the torch version above 2.0.

# Clone the repositorygit https://github.com/Rcrossmeister/Knowledge-to-SQL.gitcd ./Knowledge-to-SQL# Create the conda environmentconda create -n dellm python=3.11.3conda activate dellm# Install the required packagespip install -r requirements.txt

Dataset

We mainly focus onBIRD dataset in our study, we also supportSpider dataset for robustness study. You can also deploy DELLM on your own database by formatting your setup according to the standard benchmarks above.

Dataset Preparation

The BIRD and Spider dataset used in the paper could bedirectly downloaded from the BIRD and Spider Leaderboard. After downloading and unzipping, the contents should be placed into the following path./dataset/bird/train/ and./dataset/bird/dev/. The Spider dataset will not be used for training, after downloading, place the corresponding contents into./dataset/spider/.

Pre-processing

As soon as the datasets are prepared, you should run the pre-processing script to generate the training data. The pre-processing step is not compulsory, you can also modify it to produce a specific input format.

python dataset/preprocessor.py \    --data_path ./dataset/bird/train/train.json \    --db_root_path ./dataset/bird/train/train_databases/ \    --output_path ./model/data/SFT-EKG.json

The data for supervised fine-tuning of DELLM will be saved in./model/data after pre-processing.

Models

We useLLaMA-2 as the backbone model in our paper, and we also support several popular open-source LLMs likeChatGLM andQwen. To load the model weight locally, usingLLaMA-2-13b as an example:

mkdir backbone_model&&cd backbone_modelgit lfs installgit clone https://huggingface.co/meta-llama/Llama-2-13b-hf

Or you can replace the local path at argument--model_name_or_path by the repository name of huggingface (e.g.meta-llama/Llama-2-13b-hf) in the following training script, the model weight will be downloaded and loaded automatically.

Training

The training implementaion was inspired byLLaMA Factory, you can check their technical reporthere. We will use the SFT and DPO module during our training. You can also use the PPO algorithm by following the instructions of LLaMA Factory with the exact same dataset as DPO.

Quick Start

We provide a script to quick start upon BIRD dataset, which supervised fine-tune the DELLM over the annotated expert knowledges proposed by BIRD train(Li et al, 2023), and further train DELLM via preference learning with database feedback.

cd ./model&& sh run.sh

Evaluation

If you run the script above, the generated expert knowledge for BIRD dev will be saved in./model/out (you can modify the path in the script to obtain generated knowledge for Spider dev). This knowledge can be used to assist SQL generation. For further evaluation, we follow the official evaluation protocol proposed byBIRD andSpider forSQL generation thenexecution verification. The corresponding evaluation scripts can be obtained from their official repositories, andyou can evaluate our knowledge-to-SQL framework by replacing the official evidence with the generated knowledge.

Citation

Please cite our paper if you include Knowledge-to-SQL in your work:

@inproceedings{hong2024knowledge,    title = "Knowledge-to-{SQL}: Enhancing {SQL} Generation with Data Expert {LLM}",    author = "Hong, Zijin  and      Yuan, Zheng  and      Chen, Hao  and      Zhang, Qinggang  and      Huang, Feiran  and      Huang, Xiao",    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",    year = "2024"}

Feel free to reach out via email if you need any help:

zijin.hong@connect.polyu.hk

About

[ACL2024 Findings] Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM

arxiv.org/pdf/2405.10517

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Knowledge-to-SQL

Setup

Environment

Dataset

Models

Training

Quick Start

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

DEEP-PolyU/Knowledge-to-SQL

Folders and files

Latest commit

History

Repository files navigation

Knowledge-to-SQL

Setup

Environment

Dataset

Models

Training

Quick Start

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages