- Notifications
You must be signed in to change notification settings - Fork1
ICDE 2025 Paper, Grounding Natural Language to SQL Translation with Data-Based Self-Explanations
License
Kaimary/CycleSQL
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Improve NL2SQL with Natural Language Explanations as Self-provided Feeback
The official repository contains the code and pre-trained models for our paperGrounding Natural Language to SQL Translation with Data-Based Self-Explanations.
This code implements:
- A plug-and-play iterative framework built uponself-provided feedback to enhance the translation accuracy of existing end-to-end models.
TL;DR: We introduce CycleSQL -- a plug-and-play framework that enables flexible integration into existing end-to-end NL2SQL models.Inspired by thefeedback mechanisms used in modern recommendation systems anditerative refinement methods introduced in LLMs, CycleSQL introduces data-grounded NL explanations of queryresults as a form of internal feedback to create a self-contained feedback loop within the end-to-end translation process, facilitating iterative self-evaluation of translation correctness.
The objective of NL2SQL translation is to convert a natural language query into an SQL query.
While significant advancements in enhancing overall translation accuracy, current end-to-end models face persistent challenges in producing desired quality output during their initial attempt, owing to the treatment of language translation as a "one-time deal".
To tackle the problem, Cyclesql introduces natural language explanations of query results as self-provided feedback and uses the feedback to validate the correctness of the translation iteratively, hence improving the overall translation accuracy.
This is the approach used in the CycleSQL method.
CycleSQL uses the following four steps to establish the feedback loop for the NL2SQL translation process:
- Provenance Tracking: Track provenance of the to-explained query result to retrieve data-level information from the database.
- Semantics Enrichment: Enhance the provenance by associating it with operation-level semantics derived from the translated SQL.
- Explanation Generation: Generate a natural language explanation by interpreting the enriched provenance information.
- Translation Verification: The generated NL explanation is utilized to verify the correctness of the underlying NL2SQL translation.Iterating through the above steps until a validated correct translation is achieved.
This process is illustrated in the diagram below:
First, you should set up a Python environment. This code base has been tested under Python 3.8.
- Install the required packages
pip install -r requirements.txt
- Download theSpider and the other three robustness variants (Spider-Realistic,Spider-Sync, andSpider-DK), and put the data into thedata folder. Unpack the datasets and create the following directory structure:
/data├── spider│ └── database│ | └── ...│ ├── dev.json│ ├── dev_gold.sql│ ├── tables.json│ ├── train_gold.sql│ ├── train.json│ └── train.json
Try to run CycleSQL (with the RESDSQL model) using corresponding beam outputs on the Spider Dev dataset:
$ bash run_infer.sh spider_dev resdsql data/spider/dev.json beam_outputs/raw/spider/resdsql.dev.beam8.txt data/spider/tables.json data/spider/database data/spider/ts_database
After running, the directoryoutputs
will be generated in the current directory with the following outcomes:
|-- spider# dataset name|-- resdsql# base model name|-- pred.txt# top-1 sql outputs from CycleSQL|-- eval_result.txt# evaluation results (utilized Spider evaluation script)
- Here is an overview of the code structure:
|-- src|-- annotator||-- annotate.py# add semantics annotations over provenance information|-- translator||-- xql2nl.py# translate provenace into natural language|-- explainer.py# asemble all parts and build up CycleSQL pipeline|-- util.py# some utility functions|-- word_dict.py# word dictionary for sql2nl translation
📃 Natural Language Inference Model:We implemented the natural language inference model based on the T5-large model. We utilize various NL2SQL models (i.e., SmBoP, PICARD, RESDSQL, and ChatGPT) to generate the training data for the model training. You can use the following command to train the model from scratch:
$ python scripts/run_classification.py --model_name_or_path t5-large --shuffle_train_dataset --do_train --do_eval --num_train_epochs 5 --learning_rate 5e-6 --per_device_train_batch_size 8 --per_device_eval_batch_size 1 --evaluation_strategy steps --train_file data/nli/train.json --validation_file data/nli/dev.json --output_dir tmp/ --load_best_model_at_end --save_total_limit 5
The natural language inference model checkpoint will be uploaded in the following link:
Model | Download Link |
---|---|
nli-classifier | nli-classifier.tar.gz |
Just put the model checkpoint into thesaved_models/checkpoint-500 folder.
The evaluation script is located in the root directoryrun_infer.sh
.You can run it with:
$ bash run_infer.sh <dataset_name> <model_name> <test_file_path> <model_raw_beam_output_file_path> <table_path> <db_dir> <test_suite_db_dir>
This project welcomes contributions and suggestions 👍.
If you find bugs in our code, encounter problems when running the code, or have suggestions for CycleSQL, please submit an issue or reach out to me (kaimary1221@163.com)!
About
ICDE 2025 Paper, Grounding Natural Language to SQL Translation with Data-Based Self-Explanations