Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.

NotificationsYou must be signed in to change notification settings

yet-another-account/openwebtext

Repository files navigation

This project is a clone of the GPT-2 WebText dataset as outlined in theOpenAI paper. This project is still heavily WIP.

Huge thanks tojcpeterson for letting me use his download code. His version of OpenWebText is super well written, so please check it out!

Dependencies

Pipenv, Python 3,

To install python dependencies:

pipenv install

Newspaper Dependencies:

On Ubuntu:

sudo apt-get install libxml2-dev libxslt-dev

On OS X:

brew install libxml2 libxslt

Usage

  1. Get list of URLs from reddit:
pipenv run python get_urls.py
  1. Download data from URLs:
pipenv run python download.py

Resulting files will be deposited indata/ with format{domain}-{sha256 hash of url}.txt.

Enjoy!

About

An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp