Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

ICDE 2025 Paper, Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

License

NotificationsYou must be signed in to change notification settings

Kaimary/CycleSQL

Repository files navigation

Improve NL2SQL with Natural Language Explanations as Self-provided Feeback

The official repository contains the code and pre-trained models for our paperGrounding Natural Language to SQL Translation with Data-Based Self-Explanations.

licensestarsFORKIssues

📖 Overview

This code implements:

  • A plug-and-play iterative framework built uponself-provided feedback to enhance the translation accuracy of existing end-to-end models.

🚀 About CycleSQL

TL;DR: We introduce CycleSQL -- a plug-and-play framework that enables flexible integration into existing end-to-end NL2SQL models.Inspired by thefeedback mechanisms used in modern recommendation systems anditerative refinement methods introduced in LLMs, CycleSQL introduces data-grounded NL explanations of queryresults as a form of internal feedback to create a self-contained feedback loop within the end-to-end translation process, facilitating iterative self-evaluation of translation correctness.

The objective of NL2SQL translation is to convert a natural language query into an SQL query.

While significant advancements in enhancing overall translation accuracy, current end-to-end models face persistent challenges in producing desired quality output during their initial attempt, owing to the treatment of language translation as a "one-time deal".

To tackle the problem, Cyclesql introduces natural language explanations of query results as self-provided feedback and uses the feedback to validate the correctness of the translation iteratively, hence improving the overall translation accuracy.

This is the approach used in the CycleSQL method.

❓ How it works

CycleSQL uses the following four steps to establish the feedback loop for the NL2SQL translation process:

  1. Provenance Tracking: Track provenance of the to-explained query result to retrieve data-level information from the database.
  2. Semantics Enrichment: Enhance the provenance by associating it with operation-level semantics derived from the translated SQL.
  3. Explanation Generation: Generate a natural language explanation by interpreting the enriched provenance information.
  4. Translation Verification: The generated NL explanation is utilized to verify the correctness of the underlying NL2SQL translation.Iterating through the above steps until a validated correct translation is achieved.

This process is illustrated in the diagram below:

⚡️ Quick Start

🙇 Prerequisites

First, you should set up a Python environment. This code base has been tested under Python 3.8.

  1. Install the required packages
pip install -r requirements.txt
  1. Download theSpider and the other three robustness variants (Spider-Realistic,Spider-Sync, andSpider-DK), and put the data into thedata folder. Unpack the datasets and create the following directory structure:
/data├── spider│   └── database│   |   └── ...│   ├── dev.json│   ├── dev_gold.sql│   ├── tables.json│   ├── train_gold.sql│   ├── train.json│   └── train.json

🌪️ Try it

Try to run CycleSQL (with the RESDSQL model) using corresponding beam outputs on the Spider Dev dataset:

$ bash run_infer.sh spider_dev resdsql data/spider/dev.json beam_outputs/raw/spider/resdsql.dev.beam8.txt data/spider/tables.json data/spider/database data/spider/ts_database

After running, the directoryoutputs will be generated in the current directory with the following outcomes:

|-- spider# dataset name|-- resdsql# base model name|-- pred.txt# top-1 sql outputs from CycleSQL|-- eval_result.txt# evaluation results (utilized Spider evaluation script)

📚 Code Structure

  • Here is an overview of the code structure:
|-- src|-- annotator||-- annotate.py# add semantics annotations over provenance information|-- translator||-- xql2nl.py# translate provenace into natural language|-- explainer.py# asemble all parts and build up CycleSQL pipeline|-- util.py# some utility functions|-- word_dict.py# word dictionary for sql2nl translation

🏋️‍♀️ Training

📃 Natural Language Inference Model:We implemented the natural language inference model based on the T5-large model. We utilize various NL2SQL models (i.e., SmBoP, PICARD, RESDSQL, and ChatGPT) to generate the training data for the model training. You can use the following command to train the model from scratch:

$ python scripts/run_classification.py --model_name_or_path t5-large --shuffle_train_dataset --do_train --do_eval --num_train_epochs 5 --learning_rate 5e-6 --per_device_train_batch_size 8 --per_device_eval_batch_size 1 --evaluation_strategy steps --train_file data/nli/train.json  --validation_file data/nli/dev.json --output_dir tmp/ --load_best_model_at_end --save_total_limit 5

👍 Download the checkpoint

The natural language inference model checkpoint will be uploaded in the following link:

ModelDownload Link
nli-classifiernli-classifier.tar.gz

Just put the model checkpoint into thesaved_models/checkpoint-500 folder.

👀 Inference

The evaluation script is located in the root directoryrun_infer.sh.You can run it with:

$ bash run_infer.sh <dataset_name> <model_name> <test_file_path> <model_raw_beam_output_file_path> <table_path> <db_dir> <test_suite_db_dir>

🌈 Contributing

This project welcomes contributions and suggestions 👍.

If you find bugs in our code, encounter problems when running the code, or have suggestions for CycleSQL, please submit an issue or reach out to me (kaimary1221@163.com)!

About

ICDE 2025 Paper, Grounding Natural Language to SQL Translation with Data-Based Self-Explanations

Topics

Resources

License

Stars

Watchers

Forks


[8]ページ先頭

©2009-2025 Movatter.jp