Kaimary/CycleSQLPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star11

ICDE 2025 Paper, Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

License

Apache-2.0 license

11 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
beam_outputs/raw		beam_outputs/raw
scripts		scripts
spider_utils		spider_utils
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_infer.sh		run_infer.sh
run_prep_train_data.sh		run_prep_train_data.sh
spinner.py		spinner.py

Repository files navigation

Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

Improve NL2SQL with Natural Language Explanations as Self-provided Feeback

The official repository contains the code and pre-trained models for our paperGrounding Natural Language to SQL Translation with Data-Based Self-Explanations.

📖 Overview

This code implements:

A plug-and-play iterative framework built uponself-provided feedback to enhance the translation accuracy of existing end-to-end models.

🚀 About CycleSQL

TL;DR: We introduce CycleSQL -- a plug-and-play framework that enables flexible integration into existing end-to-end NL2SQL models.Inspired by thefeedback mechanisms used in modern recommendation systems anditerative refinement methods introduced in LLMs, CycleSQL introduces data-grounded NL explanations of queryresults as a form of internal feedback to create a self-contained feedback loop within the end-to-end translation process, facilitating iterative self-evaluation of translation correctness.

The objective of NL2SQL translation is to convert a natural language query into an SQL query.

While significant advancements in enhancing overall translation accuracy, current end-to-end models face persistent challenges in producing desired quality output during their initial attempt, owing to the treatment of language translation as a "one-time deal".

To tackle the problem, CycleSQL introduces natural language explanations of query results as self-provided feedback and uses the feedback to validate the correctness of the translation iteratively, hence improving the overall translation accuracy.

This is the approach used in the CycleSQL method.

❓ How it works

CycleSQL uses the following four steps to establish the feedback loop for the NL2SQL translation process:

Provenance Tracking: Track provenance of the to-explained query result to retrieve data-level information from the database.
Semantics Enrichment: Enhance the provenance by associating it with operation-level semantics derived from the translated SQL.
Explanation Generation: Generate a natural language explanation by interpreting the enriched provenance information.
Translation Verification: The generated NL explanation is utilized to verify the correctness of the underlying NL2SQL translation.Iterating through the above steps until a validated correct translation is achieved.

This process is illustrated in the diagram below:

⚡️ Quick Start

🙇 Prerequisites

First, you should set up a Python environment. This code base has been tested under Python 3.8.

Install the required packages

pip install -r requirements.txt

Download theSpider and the other three robustness variants (Spider-Realistic,Spider-Sync, andSpider-DK), and put the data into thedata folder. Unpack the datasets and create the following directory structure:

/data├── spider│   └── database│   |   └── ...│   ├── dev.json│   ├── dev_gold.sql│   ├── tables.json│   ├── train_gold.sql│   ├── train.json│   └── train.json

🌪️ Try it

Try to run CycleSQL (with the RESDSQL model) using corresponding beam outputs on the Spider Dev dataset:

$ bash run_infer.sh spider_dev resdsql data/spider/dev.json beam_outputs/raw/spider/resdsql.dev.beam8.txt data/spider/tables.json data/spider/database data/spider/ts_database

After running, the directoryoutputs will be generated in the current directory with the following outcomes:

|-- spider# dataset name|-- resdsql# base model name|-- pred.txt# top-1 sql outputs from CycleSQL|-- eval_result.txt# evaluation results (utilized Spider evaluation script)

📚 Code Structure

Here is an overview of the code structure:

|-- src|-- annotator||-- annotate.py# add semantics annotations over provenance information|-- translator||-- xql2nl.py# translate provenace into natural language|-- explainer.py# asemble all parts and build up CycleSQL pipeline|-- util.py# some utility functions|-- word_dict.py# word dictionary for sql2nl translation

🏋️‍♀️ Training

📃 Natural Language Inference Model:We implemented the natural language inference model based on the T5-large model. We utilize various NL2SQL models (i.e., SmBoP, PICARD, RESDSQL, and ChatGPT) to generate the training data for the model training. You can use the following command to train the model from scratch:

$ python scripts/run_classification.py --model_name_or_path t5-large --shuffle_train_dataset --do_train --do_eval --num_train_epochs 5 --learning_rate 5e-6 --per_device_train_batch_size 8 --per_device_eval_batch_size 1 --evaluation_strategy steps --train_file data/nli/train.json  --validation_file data/nli/dev.json --output_dir tmp/ --load_best_model_at_end --save_total_limit 5

👍 Download the checkpoint

The natural language inference model checkpoint will be uploaded in the following link:

Model	Download Link
`nli-classifier`	nli-classifier.tar.gz

Just put the model checkpoint into thesaved_models/checkpoint-500 folder.

👀 Inference

The evaluation script is located in the root directoryrun_infer.sh.You can run it with:

$ bash run_infer.sh <dataset_name> <model_name> <test_file_path> <model_raw_beam_output_file_path> <table_path> <db_dir> <test_suite_db_dir>

🌈 Contributing

This project welcomes contributions and suggestions 👍.

If you find bugs in our code, encounter problems when running the code, or have suggestions for CycleSQL, please submit an issue or reach out to me (kaimary1221@163.com)!

About

ICDE 2025 Paper, Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

arxiv.org/pdf/2411.02948

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

📖 Overview

🚀 About CycleSQL

❓ How it works

⚡️ Quick Start

🙇 Prerequisites

🌪️ Try it

📚 Code Structure

🏋️‍♀️ Training

👍 Download the checkpoint

👀 Inference

🌈 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

Movatterモバイル変換

License

Kaimary/CycleSQL

Folders and files

Latest commit

History

Repository files navigation

Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

📖 Overview

🚀 About CycleSQL

❓ How it works

⚡️ Quick Start

🙇 Prerequisites

🌪️ Try it

📚 Code Structure

🏋️‍♀️ Training

👍 Download the checkpoint

👀 Inference

🌈 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages