Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

The data and the PyTorch implementation for the models and experiments in the paper "Edisum: Summarizing and Explaining Wikipedia Edits at Scale"

License

NotificationsYou must be signed in to change notification settings

epfl-dlab/edisum

Repository files navigation

This repository contains the PyTorch implementation for the models and experiments in the paperEdisum: Summarizing and Explaining Wikipedia Edits at Scale

@article{šakota2024edisum,      title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale},       author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West},      journal={arXiv preprint arXiv:2404.03428}      year={2024}}

Please consider citing our work, if you found the provided resources useful.

1. Setup

Start by cloning the repository:

git clone https://github.com/epfl-dlab/edisum.git

We recommend creating a newconda virtual environment as follows:

conda env create -f environment.yml

This command also installs all the necessary packages.

2. Downloading data and models

The data is available onhuggingface and can be loaded with:

from datasets import load_datasetdataset = load_dataset("msakota/edisum_dataset")

Alternatively, to download the collected data for the experiments, run:

bash ./download_data.sh

For downloading the trained models (available onhuggingface), run:

bash ./download_models.sh

3. Usage

Training

To train a model from scratch on the desired data, run:

DATA_DIR="./data/100_perc_synth_data/"# specify a directory where training data is locatedRUN_NAME="train_longt5_100_synth"python run_train.py run_name=$RUN_NAME dir=$DATA_DIR +experiment=finetune_longt5

Inference

To run inference on a trained model:

DATA_DIR="./data/100_perc_synth_data/"# specify a directory where training data is locatedCHECKPOINT_PATH="./models/edisum_100.ckpt"# specify path to the trained modelRUN_NAME="inference_longt5_100_synth"python run_inference.py run_name=$RUN_NAME dir=$DATA_DIR checkpoint_path=$CHECKPOINT_PATH +experiment=inference_longt5

4. Experimenting with custom inputs

By providing an edit diff link

To test any of the trained models on an arbitrary edit diff link:

python run_model.py --model_name_or_path edisum_100 --diff_link"https://en.wikipedia.org/w/index.php?title=C/2023_A3_(Tsuchinshan–ATLAS)&diff=prev&oldid=1251441412"

Optionally, you can stop the generation in case there are any node changes (as the generated edit might not reflect the changes exhaustively) by adding-prohibit_node. If nomodel_name_or_path is provided, the script defaults toedisum_100. You can provide a path towards any .ckpt model, or specify one of the five models from the paper:[edisum_0, edisum_25, edisum_50, edisum_75, edisum_100], where the number represents percentage of synthetic data in the training dataset.

By providing a custom input

To test any custom input, which might not necessarily be a real edit:

python run_model.py --model_name_or_path edisum_100 --input_text<your_input_text>

For an optimal performance, the input text should be formatted in the way training data was formatted:

  1. Edit diff should be represented by collecting sentences that were altered, added or removed during the edit into two sets:previous (belonging to the previous revision of the page) andcurrent sentences (belonging to the current revision of the page)
  2. Previous sentences should contain each sentence that was removed from the previous revision, and versions of the sentences which were altered from the previous revision
  3. New sentences should contain each sentence that was added to the new revision, and versions of the sentences which were altered in the new revision
  4. Input is then made concatenating each sentence inprevious sentences, separating them with<sent_sep>, and adding a prefix<old_text>. Similarly, sentences incurrent sentences are separated with the same<sent_sep> and prefix<new_text> is added. Final input is then dervied by concatenating these two repesentations.

Example:

Screenshot 2024-10-16 at 12 03 49

Jupyter notebook

We also provide a Jupyter notebook for experimentation with custom inputs:playground.ipynb

License

This project is licensed under the terms of the MIT license.

About

The data and the PyTorch implementation for the models and experiments in the paper "Edisum: Summarizing and Explaining Wikipedia Edits at Scale"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp