EleutherAI (/əˈluːθər/[2]) is a grass-roots non-profitartificial intelligence (AI) research group. The group, considered an open-source version ofOpenAI,[3] was formed in aDiscord server in 2020 to create an open-source version ofGPT-3.[4] In early 2023, it formally incorporated as the EleutherAI Institute, a non-profit research institute.[5] As of 2025, the nonprofit maintains widely-used training datasets, conducts research, and is involved in public policy, among other activities.[4]
EleutherAI began as aDiscord server on July 7, 2020, under the tentative name "LibreAI" before rebranding to "EleutherAI" later that month,[6][better source needed] in reference toeleutheria, the Greek word forliberty.[3] Its founding members are Connor Leahy, Leo Gao, and Sid Black.[3] They co-wrote the code for EleutherAI to serve as a collection ofopen source AI research, creating a machine learning model similar toGPT-3.[5]
On December 31, 2020, EleutherAI releasedThe Pile, a curated dataset of diverse text for traininglarge language models. While the paper referenced the existence of the GPT-Neo models, the models themselves were not released until March 21, 2021.[7][better source needed] On June 9, 2021, EleutherAI followed this up withGPT-J-6B, a six billion parameter language model that was again the largest open-source GPT-3-like model in the world.[8][better source needed] These language models were released under theApache 2.0 free software license and are considered to have "fueled an entirely new wave of startups".[5]
While EleutherAI initially turned down funding offers, preferring to use Google's TPU Research Cloud Program to source their compute,[3] by early 2021 they had accepted funding fromCoreWeave (a small cloud computing company) and SpellML (a cloud infrastructure company) in the form of access to powerful GPU clusters that are necessary for large scale machine learning research. On Feb 10, 2022, they released GPT-NeoX-20B, a model similar to their prior work but scaled up thanks to the resources CoreWeave provided.[9]
In early 2023, EleutherAI incorporated as a non-profit research institute run by Stella Biderman, Curtis Huebner, and Shivanshu Purohit.[5] EleutherAI also announced a shift towards doing work ininterpretability,alignment, and scientific research.[10][non-primary source needed] EleutherAI felt that "there is substantially more interest in training and releasing LLMs than there once was", enabling them to focus on other projects.[11]
In July 2024, an investigation byProof news found that EleutherAI's The Pile dataset includes subtitles from over 170,000YouTube videos across more than 48,000 channels. The findings drew criticism and accusations of theft from YouTubers and others who had their work published on the platform.[12]
In 2025, EleutherAI released a new dataset for training AI, "Common Pile", that does not have the controversial copyrighted material contained in its previous release of The Pile, and trained two models from it.[13] EleutherAI, in collaboration with the UK'sAI Security Institute, found that filtering the training data to remove key concepts can maintain performance while reducing the ability to provide harmful information.[14][15]
The Pile is an 886 GB dataset designed for training large language models. It was originally developed to train EleutherAI's GPT-Neo models[17] but has become widely used to train other models, includingMicrosoft's Megatron-Turing Natural Language Generation.[18] Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thoroughly documented by the researchers who developed it.[19] The initial Pile dataset has come under scrutiny for containing copyrighted material[13] including books[20][21][22] and subtitles from documentaries, movies, television and online videos[23] including from YouTube.[24][12]
Common Pile v0.1, released in partnership with a large number of collaborators in June 2025, contains only works where the licenses permit their use for training AI models.[13][25]
EleutherAI's most prominent research relates to its work to train open-sourcelarge language models inspired by OpenAI'sGPT-3.[7][better source needed] EleutherAI's "GPT-Neo" model series has released 125 million, 1.3 billion, 2.7 billion, 6 billion, and 20 billion parameter models.
GPT-Neo (125M, 1.3B, 2.7B):[26] released in March 2021, it was the largest open-source GPT-3-style language model in the world at the time of release.
GPT-J (6B): released in March 2021, it was the largest open-source GPT-3-style language model in the world at the time of release.
Anartificial intelligence art created with VQGAN-CLIP, atext-to-image model created by EleutherAIAn artificial intelligence art created with CLIP-Guided Diffusion, another text-to-image model created by Katherine Crowson of EleutherAI[27][28]
Following the release ofDALL-E by OpenAI in January 2021, EleutherAI started working ontext-to-image synthesis models. When OpenAI did not release DALL-E publicly, EleutherAI's Katherine Crowson and digital artist Ryan Murdock developed a technique for using CLIP (another model developed by OpenAI) to convert regular image generation models into text-to-image synthesis ones.[29][30][31][32] Building on ideas dating back to Google'sDeepDream,[33] they found their first major success combining CLIP with another publicly available model called VQGAN and the resulting model is called VQGAN-CLIP.[34] Crowson released the technology by tweetingnotebooks demonstrating the technique that people could run for free without any special equipment.[35][36][37]
^Khan, Mehtab; Hanna, Alex (2023). "The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability".Ohio State Technology Law Journal.19 (2):171–256.hdl:1811/103549.SSRN4217148.