- Notifications
You must be signed in to change notification settings - Fork60
yet-another-account/openwebtext
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project is a clone of the GPT-2 WebText dataset as outlined in theOpenAI paper. This project is still heavily WIP.
Huge thanks tojcpeterson for letting me use his download code. His version of OpenWebText is super well written, so please check it out!
Pipenv, Python 3,
To install python dependencies:
pipenv install
Newspaper Dependencies:
On Ubuntu:
sudo apt-get install libxml2-dev libxslt-dev
On OS X:
brew install libxml2 libxslt
- Get list of URLs from reddit:
pipenv run python get_urls.py
- Download data from URLs:
pipenv run python download.py
Resulting files will be deposited indata/
with format{domain}-{sha256 hash of url}.txt
.
Enjoy!