- Notifications
You must be signed in to change notification settings - Fork0
The data and the PyTorch implementation for the models and experiments in the paper "Edisum: Summarizing and Explaining Wikipedia Edits at Scale"
License
epfl-dlab/edisum
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This repository contains the PyTorch implementation for the models and experiments in the paperEdisum: Summarizing and Explaining Wikipedia Edits at Scale
@article{šakota2024edisum, title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale}, author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West}, journal={arXiv preprint arXiv:2404.03428} year={2024}}Please consider citing our work, if you found the provided resources useful.
Start by cloning the repository:
git clone https://github.com/epfl-dlab/edisum.git
We recommend creating a newconda virtual environment as follows:
conda env create -f environment.yml
This command also installs all the necessary packages.
The data is available onhuggingface and can be loaded with:
from datasets import load_datasetdataset = load_dataset("msakota/edisum_dataset")Alternatively, to download the collected data for the experiments, run:
bash ./download_data.sh
For downloading the trained models (available onhuggingface), run:
bash ./download_models.sh
To train a model from scratch on the desired data, run:
DATA_DIR="./data/100_perc_synth_data/"# specify a directory where training data is locatedRUN_NAME="train_longt5_100_synth"python run_train.py run_name=$RUN_NAME dir=$DATA_DIR +experiment=finetune_longt5
To run inference on a trained model:
DATA_DIR="./data/100_perc_synth_data/"# specify a directory where training data is locatedCHECKPOINT_PATH="./models/edisum_100.ckpt"# specify path to the trained modelRUN_NAME="inference_longt5_100_synth"python run_inference.py run_name=$RUN_NAME dir=$DATA_DIR checkpoint_path=$CHECKPOINT_PATH +experiment=inference_longt5
To test any of the trained models on an arbitrary edit diff link:
python run_model.py --model_name_or_path edisum_100 --diff_link"https://en.wikipedia.org/w/index.php?title=C/2023_A3_(Tsuchinshan–ATLAS)&diff=prev&oldid=1251441412"Optionally, you can stop the generation in case there are any node changes (as the generated edit might not reflect the changes exhaustively) by adding-prohibit_node. If nomodel_name_or_path is provided, the script defaults toedisum_100. You can provide a path towards any .ckpt model, or specify one of the five models from the paper:[edisum_0, edisum_25, edisum_50, edisum_75, edisum_100], where the number represents percentage of synthetic data in the training dataset.
To test any custom input, which might not necessarily be a real edit:
python run_model.py --model_name_or_path edisum_100 --input_text<your_input_text>
For an optimal performance, the input text should be formatted in the way training data was formatted:
- Edit diff should be represented by collecting sentences that were altered, added or removed during the edit into two sets:previous (belonging to the previous revision of the page) andcurrent sentences (belonging to the current revision of the page)
- Previous sentences should contain each sentence that was removed from the previous revision, and versions of the sentences which were altered from the previous revision
- New sentences should contain each sentence that was added to the new revision, and versions of the sentences which were altered in the new revision
- Input is then made concatenating each sentence inprevious sentences, separating them with
<sent_sep>, and adding a prefix<old_text>. Similarly, sentences incurrent sentences are separated with the same<sent_sep>and prefix<new_text>is added. Final input is then dervied by concatenating these two repesentations.
Example:

We also provide a Jupyter notebook for experimentation with custom inputs:playground.ipynb
This project is licensed under the terms of the MIT license.
About
The data and the PyTorch implementation for the models and experiments in the paper "Edisum: Summarizing and Explaining Wikipedia Edits at Scale"
Resources
License
Uh oh!
There was an error while loading.Please reload this page.