epfl-dlab/edisumPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star1

The data and the PyTorch implementation for the models and experiments in the paper "Edisum: Summarizing and Explaining Wikipedia Edits at Scale"

License

MIT license

1 star 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
download_data.sh		download_data.sh
download_models.sh		download_models.sh
environment.yml		environment.yml
gen_edit_summary.py		gen_edit_summary.py
playground.ipynb		playground.ipynb
run_finetune.sh		run_finetune.sh
run_inference.py		run_inference.py
run_inference.sh		run_inference.sh
run_model.py		run_model.py
run_train.py		run_train.py

Repository files navigation

Edisum

This repository contains the PyTorch implementation for the models and experiments in the paperEdisum: Summarizing and Explaining Wikipedia Edits at Scale

@article{šakota2024edisum,      title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale},       author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West},      journal={arXiv preprint arXiv:2404.03428}      year={2024}}

Please consider citing our work, if you found the provided resources useful.

1. Setup

Start by cloning the repository:

git clone https://github.com/epfl-dlab/edisum.git

We recommend creating a newconda virtual environment as follows:

conda env create -f environment.yml

This command also installs all the necessary packages.

2. Downloading data and models

The data is available onhuggingface and can be loaded with:

from datasets import load_datasetdataset = load_dataset("msakota/edisum_dataset")

Alternatively, to download the collected data for the experiments, run:

bash ./download_data.sh

For downloading the trained models (available onhuggingface), run:

bash ./download_models.sh

3. Usage

Training

To train a model from scratch on the desired data, run:

DATA_DIR="./data/100_perc_synth_data/"# specify a directory where training data is locatedRUN_NAME="train_longt5_100_synth"python run_train.py run_name=$RUN_NAME dir=$DATA_DIR +experiment=finetune_longt5

Inference

To run inference on a trained model:

DATA_DIR="./data/100_perc_synth_data/"# specify a directory where training data is locatedCHECKPOINT_PATH="./models/edisum_100.ckpt"# specify path to the trained modelRUN_NAME="inference_longt5_100_synth"python run_inference.py run_name=$RUN_NAME dir=$DATA_DIR checkpoint_path=$CHECKPOINT_PATH +experiment=inference_longt5

4. Experimenting with custom inputs

By providing an edit diff link

To test any of the trained models on an arbitrary edit diff link:

python run_model.py --model_name_or_path edisum_100 --diff_link"https://en.wikipedia.org/w/index.php?title=C/2023_A3_(Tsuchinshan–ATLAS)&diff=prev&oldid=1251441412"

Optionally, you can stop the generation in case there are any node changes (as the generated edit might not reflect the changes exhaustively) by adding-prohibit_node. If nomodel_name_or_path is provided, the script defaults toedisum_100. You can provide a path towards any .ckpt model, or specify one of the five models from the paper:[edisum_0, edisum_25, edisum_50, edisum_75, edisum_100], where the number represents percentage of synthetic data in the training dataset.

By providing a custom input

To test any custom input, which might not necessarily be a real edit:

python run_model.py --model_name_or_path edisum_100 --input_text<your_input_text>

For an optimal performance, the input text should be formatted in the way training data was formatted:

Edit diff should be represented by collecting sentences that were altered, added or removed during the edit into two sets:previous (belonging to the previous revision of the page) andcurrent sentences (belonging to the current revision of the page)
Previous sentences should contain each sentence that was removed from the previous revision, and versions of the sentences which were altered from the previous revision
New sentences should contain each sentence that was added to the new revision, and versions of the sentences which were altered in the new revision
Input is then made concatenating each sentence inprevious sentences, separating them with<sent_sep>, and adding a prefix<old_text>. Similarly, sentences incurrent sentences are separated with the same<sent_sep> and prefix<new_text> is added. Final input is then dervied by concatenating these two repesentations.

Example:

Jupyter notebook

We also provide a Jupyter notebook for experimentation with custom inputs:playground.ipynb

License

This project is licensed under the terms of the MIT license.

About

The data and the PyTorch implementation for the models and experiments in the paper "Edisum: Summarizing and Explaining Wikipedia Edits at Scale"

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Edisum

1. Setup

2. Downloading data and models

3. Usage

Training

Inference

4. Experimenting with custom inputs

By providing an edit diff link

By providing a custom input

Jupyter notebook

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

epfl-dlab/edisum

Folders and files

Latest commit

History

Repository files navigation

Edisum

1. Setup

2. Downloading data and models

3. Usage

Training

Inference

4. Experimenting with custom inputs

By providing an edit diff link

By providing a custom input

Jupyter notebook

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages