rosewang2008/language_modeling_via_stochastic_processesPublic

NotificationsYou must be signed in to change notification settings
Fork14
Star138

Language modeling via stochastic processes. Oral @ ICLR 2022.

138 stars 14 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
analysis		analysis
assets		assets
decoder		decoder
encoder		encoder
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Repository files navigation

Language modeling via stochastic processes

[Paper][Open Review][Long Video]

ICLR Oral 2022

Rose E Wang, Esin Durmus, Noah Goodman, Tatsunori Hashimoto

Introduction

Abstract:Modern language models can generate high-quality short texts. However, they often meander or are incoherent when generating longer texts.These issues arise from the next-token-only language modeling objective.Recent work in self-supervised learning suggests that models can learn good latent representations via contrastive learning, which can be effective for discriminative tasks.Our work analyzes the application of contrastive representations for generative tasks, like long text generation.We propose one approach for leveraging constrastive representations, which we call Time Control (TC). TC first learns a contrastive representation of the target text domain, then generates text by decoding from these representations.Compared to domain-specific methods and fine-tuning GPT2 across a variety of text domains, TC performs competitively to methods specific for learning sentence representations on discourse coherence. On long text generation settings, TC preserves the text structure both in terms of ordering (up to +15% better) and text length consistency (up to +90% better).

Contents:

Installation

Follow the commands insetup.sh
Make sure you are in the virtual environment:conda activate language_modeling_via_stochastic_processes
Install the decoder's version of the transformers library:

cd decoder # enter the decoder repopip install -e . # Installing transformers locally; I modified their GPT2 module to take in our learned embeddings for decoding.

Make sure you have awandb account!

Datasets

This repo contains all but two datasets (Wikihow and Recipe NLG). Instructions are below.

The other four datasets are already in this repo.

Wikihow

The Wikihow dataset needs to be downloaded fromthis link. It's a pkl file that should go under aspath/2/repo/data/wikihow/wiki_how_data.pkl.

Wikisection

The Wikisection dataset used in this paper is already included.

It came fromthis prior work -- specifically, we used the English city wikipedia articles.

Recipe NLG

The Recipe NLG dataset needs to be downloaded.Download theRecipe NLG dataset and put the data underencoder/data/recipe_nlg.

TM2

The TM2 dataset used in this paper is already included.It came from theTM2 Restaurant Search dataset.

TicketTalk

The TicketTalk dataset used in this paper is already included.
It can be found as theTicketTalk dataset (all the json files).

Encoder

Before running experiments,cd encoder/code; source init_env.sh

Inencoder/code/scripts/run_ou.py, set the variable nameckpt_dir to your checkpoint directory.

The script for training the encoders (TC, VAE, Brownian, InfoNCE) can be found atencoder/code/scripts/train_encoders.sh.

Encoder experiments

Before running experiments,cd encoder/code; source init_env.sh

Inencoder/code/scripts/run_discourse.py andencoder/code/src/systems/discourse_system.py, set the correct paths to your data directory and repo.

The script for running the discourse coherence experiments can be found atencoder/code/scripts/discourse.sh.

Decoder

For training the decoder, you'll need to be in directorydecoder/examples/pytorch/language-modeling/.

The script for training the decoder can be found atdecoder/examples/pytorch/language-modeling/train_encoders.sh. Make sure to change thepath2repo variable.

You'll need to change the directories to your data directory as appropriate inrun_time_clm.py

Generation

For generating texts, you'll need to be in directorydecoder/transformers/examples/pytorch/text-generation/.

The script for generating text and measuring per-section length mismatches can be found atdecoder/transformers/examples/pytorch/text-generation/toy_wikisection_generation.sh.

The script for generating long texts can be found atdecoder/transformers/examples/pytorch/text-generation/long_generation.sh.

Analysis

To collect all the metrics, check outanalysis/run_analysis.sh. You can run all the evaluations withsource analysis/run_analysis.sh.

Remember to change the wandb username and project name as what you listed in the encoder and decoder experiments.

About

Language modeling via stochastic processes. Oral @ ICLR 2022.

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Language modeling via stochastic processes

Introduction

Installation

Datasets

Wikihow

Wikisection

Recipe NLG

TM2

TicketTalk

Encoder

Encoder experiments

Decoder

Generation

Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

rosewang2008/language_modeling_via_stochastic_processes

Folders and files

Latest commit

History

Repository files navigation

Language modeling via stochastic processes

Introduction

Installation

Datasets

Wikihow

Wikisection

Recipe NLG

TM2

TicketTalk

Encoder

Encoder experiments

Decoder

Generation

Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages