EleutherAI (/əˈluːθər/[2]) is a grass-roots non-profitartificial intelligence (AI) research group. The group, considered an open-source version ofOpenAI,[3] was formed in aDiscord server in July 2020 by Connor Leahy, Sid Black, and Leo Gao[4] to organize a replication ofGPT-3. In early 2023, it formally incorporated as the EleutherAI Institute, a non-profit research institute.[5]
EleutherAI began as aDiscord server on July 7, 2020, under the tentative name "LibreAI" before rebranding to "EleutherAI" later that month,[6] in reference toeleutheria, the Greek word forliberty.[3] Its founding members are Connor Leahy, Len Gao, and Sid Black. They co-wrote the code for Eleuther to serve as a collection of open source AI research, creating a machine learning model similar toGPT-3.[7]
On December 30, 2020, EleutherAI releasedThe Pile, a curated dataset of diverse text for traininglarge language models.[8] While the paper referenced the existence of the GPT-Neo models, the models themselves were not released until March 21, 2021.[9] According to a retrospective written several months later, the authors did not anticipate that "people would care so much about our 'small models.'"[1] On June 9, 2021, EleutherAI followed this up withGPT-J-6B, a six billion parameter language model that was again the largest open-source GPT-3-like model in the world.[10] These language models were released under theApache 2.0 free software license and are considered to have "fueled an entirely new wave of startups".[5]
While EleutherAI initially turned down funding offers, preferring to use Google's TPU Research Cloud Program to source their compute,[11] by early 2021 they had accepted funding fromCoreWeave (a small cloud computing company) and SpellML (a cloud infrastructure company) in the form of access to powerful GPU clusters that are necessary for large scale machine learning research. On Feb 10, 2022, they released GPT-NeoX-20B, a model similar to their prior work but scaled up thanks to the resources CoreWeave provided.[12]
In 2022, many EleutherAI members participated in the BigScience Research Workshop, working on projects including multitask finetuning,[13][14] trainingBLOOM,[15] and designing evaluation libraries.[15] Engineers at EleutherAI,Stability AI, andNVIDIA joined forces with biologists led byColumbia University andHarvard University[16]to train OpenFold, an open-source replication of DeepMind'sAlphaFold2.[17]
In early 2023, EleutherAI incorporated as a non-profit research institute run by Stella Biderman, Curtis Huebner, and Shivanshu Purohit.[5][18] This announcement came with the statement that EleutherAI's shift of focus away from training larger language models was part of a deliberate push towards doing work in interpretability, alignment, and scientific research.[18] While EleutherAI is still committed to promoting access to AI technologies, they feel that "there is substantially more interest in training and releasing LLMs than there once was," enabling them to focus on other projects.[19]
In July 2024, an investigation byProof news found that EleutherAI's The Pile dataset includes subtitles from over 170,000YouTube videos across more than 48,000 channels. The findings drew criticism and accusations of theft from YouTubers and others who had their work published on the platform.[20][21] In 2025, Stella Biderman served as executive director. Aviya Skowron served as head of policy and ethics. Nora Belrose served as head of interpretability, and Quentin Anthony was head of HPC.[22]
According to their website, EleutherAI is a "decentralized grassroots collective of volunteer researchers, engineers, and developers focused onAI alignment, scaling, andopen-sourceAI research".[23] While they do not sell any of their technologies as products, they publish the results of their research in academic venues, write blog posts detailing their ideas and methodologies, and provide trained models for anyone to use for free.[citation needed]
The Pile is an 886 GB dataset designed for training large language models. It was originally developed to train EleutherAI's GPT-Neo models but has become widely used to train other models, includingMicrosoft's Megatron-Turing Natural Language Generation,[24][25]Meta AI's OpenPre-trained Transformers,[26]LLaMA,[27] and Galactica,[28]Stanford University's BioMedLM 2.7B,[29] theBeijing Academy of Artificial Intelligence's Chinese-Transformer-XL,[30] andYandex's YaLM 100B.[31] Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thoroughly documented by the researchers who developed it.[32]
EleutherAI's most prominent research relates to its work to train open-sourcelarge language models inspired by OpenAI'sGPT-3.[33] EleutherAI's "GPT-Neo" model series has released 125 million, 1.3 billion, 2.7 billion, 6 billion, and 20 billion parameter models.
GPT-Neo (125M, 1.3B, 2.7B):[34] released in March 2021, it was the largest open-source GPT-3-style language model in the world at the time of release.
GPT-J (6B):[35] released in March 2021, it was the largest open-source GPT-3-style language model in the world at the time of release.[36]
GPT-NeoX (20B):[37] released in February 2022, it was the largest open-source language model in the world at the time of release.
Pythia (13B):[38] While prior models focused on scaling larger to close the gap with closed-sourced models like GPT-3, the Pythia model suite goes in another direction. The Pythia suite was designed to facilitate scientific research on the capabilities of and learning processes in large language models.[38] Featuring 154 partially trained model checkpoints, fully public training data, and the ability to reproduce the exact training order, Pythia enables research on verifiable training,[39] social biases,[38] memorization,[40] and more.[41]
Anartificial intelligence art created with VQGAN-CLIP, atext-to-image model created by EleutherAIAn artificial intelligence art created with CLIP-Guided Diffusion, another text-to-image model created by Katherine Crowson of EleutherAI[42][43]
Following the release ofDALL-E by OpenAI in January 2021, EleutherAI started working ontext-to-image synthesis models. When OpenAI did not release DALL-E publicly, EleutherAI's Katherine Crowson and digital artist Ryan Murdock developed a technique for using CLIP (another model developed by OpenAI) to convert regular image generation models into text-to-image synthesis ones.[44][45][46][47] Building on ideas dating back to Google'sDeepDream,[48] they found their first major success combining CLIP with another publicly available model called VQGAN and the resulting model is called VQGAN-CLIP.[49] Crowson released the technology by tweetingnotebooks demonstrating the technique that people could run for free without any special equipment.[50][51][52] This work was credited byStability AI CEOEmad Mostaque as motivating the founding of Stability AI.[53]
EleutherAI's work to democratize GPT-3 won theUNESCO Netexplo Global Innovation Award in 2021,[54] InfoWorld's Best of Open Source Software Award in 2021[55] and 2022,[56] was nominated for VentureBeat's AI Innovation Award in 2021.[57]
Gary Marcus, a cognitive scientist and noted critic of deep learning companies such as OpenAI and DeepMind,[58] has repeatedly[59][60] praised EleutherAI's dedication to open-source and transparent research.
Maximilian Gahntz, a senior policy researcher at theMozilla Foundation, applauded EleutherAI's efforts to give more researchers the ability to audit and assess AI technology. "If models are open and if data sets are open, that'll enable much more of the critical research that's pointed out many of the flaws and harms associated with generative AI and that's often far too difficult to conduct."[61]
Technology journalist Kyle Wiggers has raised concerns about whether EleutherAI is as independent as it claims, or "whether the involvement of commercially motivated ventures likeStability AI andHugging Face—both of which are backed by substantial venture capital—might influence EleutherAI's research."[62]
^Gao, Leo; Biderman, Stella; Black, Sid; et al. (31 December 2020).The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv 2101.00027.arXiv:2101.00027.
^Black, Sid; Biderman, Stella; Hallahan, Eric; et al. (14 April 2022). "GPT-NeoX-20B| An Open-Source Autoregressive Language Model".arXiv:2204.06745 [cs.CL].
^Sanh, Victor; et al. (2021). "Multitask Prompted Training Enables Zero-Shot Task Generalization".arXiv:2110.08207 [cs.LG].
^Muennighoff, Niklas; Wang, Thomas; Sutawika, Lintang; Roberts, Adam; Biderman, Stella; Teven Le Scao; M Saiful Bari; Shen, Sheng; Yong, Zheng-Xin; Schoelkopf, Hailey; Tang, Xiangru; Radev, Dragomir; Alham Fikri Aji; Almubarak, Khalid; Albanie, Samuel; Alyafeai, Zaid; Webson, Albert; Raff, Edward; Raffel, Colin (2022). "Crosslingual Generalization through Multitask Finetuning".arXiv:2211.01786 [cs.CL].
^abWorkshop, BigScience; et al. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model".arXiv:2211.05100 [cs.CL].
^Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT: Open Pre-trained Transformer Language Models".arXiv:2205.01068 [cs.CL].
^Touvron, Hugo; Lavril, Thibaut; Izacard, Gautier; Grave, Edouard; Lample, Guillaume; et al. (27 February 2023). "LLaMA: Open and Efficient Foundation Language Models".arXiv:2302.13971 [cs.CL].
^Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language Model for Science".arXiv:2211.09085 [cs.CL].
^Khan, Mehtab; Hanna, Alex (2023). "The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability".Ohio State Technology Law Journal.19 (2):171–256.hdl:1811/103549.SSRN4217148.
^abcBiderman, Stella; Schoelkopf, Hailey; Anthony, Quentin; Bradley, Herbie; O'Brien, Kyle; Hallahan, Eric; Mohammad Aflah Khan; Purohit, Shivanshu; USVSN Sai Prashanth; Raff, Edward; Skowron, Aviya; Sutawika, Lintang; Oskar van der Wal (2023). "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling".arXiv:2304.01373 [cs.CL].
^Choi, Dami; Shavit, Yonadav; Duvenaud, David (2023). "Tools for Verifying Neural Models' Training Data".arXiv:2307.00682 [cs.LG].
^Biderman, Stella; USVSN Sai Prashanth; Sutawika, Lintang; Schoelkopf, Hailey; Anthony, Quentin; Purohit, Shivanshu; Raff, Edward (2023). "Emergent and Predictable Memorization in Large Language Models".arXiv:2304.11158 [cs.CL].
^Gupta, Kshitij; Thérien, Benjamin; Ibrahim, Adam; Richter, Mats L.; Anthony, Quentin; Belilovsky, Eugene; Rish, Irina; Lesort, Timothée (2023). "Continual Pre-Training of Large Language Models: How to (Re)warm your model?".arXiv:2308.04014 [cs.CL].
^Yegulalp, James R. Borck, Martin Heller, Andrew C. Oliver, Ian Pointer, Matthew Tyson and Serdar (18 October 2021)."The best open source software of 2021".InfoWorld.Archived from the original on 8 March 2023. Retrieved8 March 2023.{{cite web}}: CS1 maint: multiple names: authors list (link)
^Yegulalp, James R. Borck, Martin Heller, Andrew C. Oliver, Ian Pointer, Isaac Sacolick, Matthew Tyson and Serdar (17 October 2022)."The best open source software of 2022".InfoWorld.Archived from the original on 8 March 2023. Retrieved8 March 2023.{{cite web}}: CS1 maint: multiple names: authors list (link)