Movatterモバイル変換

This repository was archived by the owner on Jul 4, 2023. It is now read-only.

PetrochukM/PyTorch-NLPPublic archive

NotificationsYou must be signed in to change notification settings
Fork256
Star2.2k

Basic Utilities for PyTorch Natural Language Processing (NLP)

pytorchnlp.readthedocs.io

License

BSD-3-Clause license

2.2k stars 256 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 451 Commits
build_tools		build_tools
docs		docs
examples		examples
tests		tests
torchnlp		torchnlp
.flake8		.flake8
.gitignore		.gitignore
.style.yapf		.style.yapf
.travis.yml		.travis.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
ISSUE_TEMPLATE.md		ISSUE_TEMPLATE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
codecov.yml		codecov.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Repository files navigation

💕 Now Archived 💕

With the PyTorch toolchain maturing, it's time to archive repos like this one. You'll be able to find more developed options for every part of this toolkit:

Happy developing! ✨

Feel free to contact me if anyone wants to unarchive this repo and continue developing it. You can reach me at "petrochukm [at] gmail.com".

Basic Utilities for PyTorch Natural Language Processing (NLP)

PyTorch-NLP, ortorchnlp for short, is a library of basic utilities for PyTorchNLP.torchnlp extends PyTorch to provide you withbasic text data processing functions.

Logo byChloe Yeo, Corporate Sponsorship byWellSaid Labs

Installation 🐾

Make sure you have Python 3.6+ and PyTorch 1.0+. You can then installpytorch-nlp usingpip:

pipinstallpytorch-nlp

Or to install the latest code via:

pipinstallgit+https://github.com/PetrochukM/PyTorch-NLP.git

Docs

The complete documentation for PyTorch-NLP is availableviaour ReadTheDocs website.

Get Started

Within an NLP data pipeline, you'll want to implement these basic steps:

1. Load your Data 🐿

Load the IMDB dataset, for example:

fromtorchnlp.datasetsimportimdb_dataset# Load the imdb training datasettrain=imdb_dataset(train=True)train[0]# RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}

Load a custom dataset, for example:

frompathlibimportPathfromtorchnlp.downloadimportdownload_file_maybe_extractdirectory_path=Path('data/')train_file_path=Path('trees/train.txt')download_file_maybe_extract(url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',directory=directory_path,check_files=[train_file_path])open(directory_path/train_file_path)

Don't worry we'll handle caching for you!

2. Text to Tensor

Tokenize and encode your text as a tensor.

For example, aWhitespaceEncoder breakstext into tokens whenever it encounters a whitespace character.

fromtorchnlp.encoders.textimportWhitespaceEncoderloaded_data= ["now this ain't funny","so don't you dare laugh"]encoder=WhitespaceEncoder(loaded_data)encoded_data= [encoder.encode(example)forexampleinloaded_data]

3. Tensor to Batch

With your loaded and encoded data in hand, you'll want to batch your dataset.

importtorchfromtorchnlp.samplersimportBucketBatchSamplerfromtorchnlp.utilsimportcollate_tensorsfromtorchnlp.encoders.textimportstack_and_pad_tensorsencoded_data= [torch.randn(2),torch.randn(3),torch.randn(4),torch.randn(5)]train_sampler=torch.utils.data.sampler.SequentialSampler(encoded_data)train_batch_sampler=BucketBatchSampler(train_sampler,batch_size=2,drop_last=False,sort_key=lambdai:encoded_data[i].shape[0])batches= [[encoded_data[i]foriinbatch]forbatchintrain_batch_sampler]batches= [collate_tensors(batch,stack_tensors=stack_and_pad_tensors)forbatchinbatches]

PyTorch-NLP builds on top of PyTorch's existingtorch.utils.data.sampler,torch.stackanddefault_collate to support sequential inputs of varying lengths!

4. Training and Inference

With your batch in hand, you can use PyTorch to develop and train your model using gradient descent.For example, check outthis example code for training on the StanfordNatural Language Inference (SNLI) Corpus.

Last But Not Least

PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗

Deterministic Functions

Now you've setup your pipeline, you may want to ensure that some functions run deterministically.Wrap any code that's random, withfork_rng and you'll be good to go, like so:

importrandomimportnumpyimporttorchfromtorchnlp.randomimportfork_rngwithfork_rng(seed=123):# Ensure determinismprint('Random:',random.randint(1,2**31))print('Numpy:',numpy.random.randint(1,2**31))print('Torch:',int(torch.randint(1,2**31, (1,))))

This will always print:

Random: 224899943Numpy: 843828735Torch: 843828736

Pre-Trained Word Vectors

Now that you've computed your vocabulary, you may want to make use ofpre-trained word vectors to set your embeddings, like so:

importtorchfromtorchnlp.encoders.textimportWhitespaceEncoderfromtorchnlp.word_to_vectorimportGloVeencoder=WhitespaceEncoder(["now this ain't funny","so don't you dare laugh"])vocab_set=set(encoder.vocab)pretrained_embedding=GloVe(name='6B',dim=100,is_include=lambdaw:winvocab_set)embedding_weights=torch.Tensor(encoder.vocab_size,pretrained_embedding.dim)fori,tokeninenumerate(encoder.vocab):embedding_weights[i]=pretrained_embedding[token]

Neural Networks Layers

For example, from the neural network package, apply the state-of-the-artLockedDropout:

importtorchfromtorchnlp.nnimportLockedDropoutinput_=torch.randn(6,3,10)dropout=LockedDropout(0.5)# Apply a LockedDropout to `input_`dropout(input_)# RETURNS: torch.FloatTensor (6x3x10)

Metrics

Compute common NLP metrics such as the BLEU score.

fromtorchnlp.metricsimportget_moses_multi_bleuhypotheses= ["The brown fox jumps over the dog 笑"]references= ["The quick brown fox jumps over the lazy dog 笑"]# Compute BLEU score with the official BLEU perl scriptget_moses_multi_bleu(hypotheses,references,lowercase=True)# RETURNS: 47.9

Help ❓

Maybe looking at longer examples may help you atexamples/.

Need more help? We are happy to answer your questions viaGitter Chat

Contributing

We've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hopethat other organizations can benefit from the project. We are thankful for any contributions fromthe community.

Contributing Guide

Read ourcontributing guideto learn about our development process, how to propose bugfixes and improvements, and how to buildand test your changes to PyTorch-NLP.

Related Work

torchtext

torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar.torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders.PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint,torchtext is object orientated with external coupling while PyTorch-NLP is object orientated withlow coupling.

AllenNLP

AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.

Authors

Michael Petrochuk — Developer
Chloe Yeo — Logo Design

Citing

If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX tocite it:

@misc{pytorch-nlp,  author = {Petrochuk, Michael},  title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},  year = {2018},  publisher = {GitHub},  journal = {GitHub repository},  howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}},}