Search epoch.ai

Enter a query to search for results

AI Models Documentation

Overview

Epoch’sAI Models Dataset is a collection of machine learning models useful for research about trends in the history and future of artificial intelligence. It includes over two thousand machine learning models, encompassing a broad range of domains and scales.

This documentation describes which models are contained within the database, its records (including data fields and definitions), processes for adding new entries and auditing accuracy. It also includes a changelog and acknowledgements.

The data contain several subsets, which can be viewed using the interactive visualization.Notable AI models are those that meet our notability criteria.Frontier models are models that were in the top 10 of training compute as of the time of their release.Large-scale models are models trained with at least 1023 floating-point operations.All models shows every model in the dataset, including models that do not qualify for the above categories, such as models found during our research on algorithmic progress, AI for biology, or other investigations.

The dataset is availableon our website as a visualization or table, and is available for download inCSV format, updated daily. For a quick-start example of loading the data and working with it in your research, see thisGoogle Colab demo notebook.

If you would like to ask any questions about the database, or suggest a model that should be added, contact us atdata@epoch.ai.

If this dataset is useful for you, please cite it.

Citation

Epoch AI, ‘Parameter, Compute and Data Trends in Machine Learning’. Published online at epoch.ai. Retrieved from: ‘https://epoch.ai/data/ai-models’ [online resource]

BibTeX citation

@misc{epoch2022pcdtrends,  title = "Parameter, Compute and Data Trends in Machine Learning",  author = {{Epoch AI}},  year = 2022,  url = {https://epoch.ai/data/ai-models},  note = "Accessed:"}

Inclusion

The database covers AI models, especially models that are notable for advancing the state of the art, or a large impact on the world or the history of the field. Here, we give an overview of how the data have been collected, and define the criteria for inclusion and notability.

Criteria

To be included in the database, an ML model must satisfyall inclusion criteria:

there must be reliable documentation of its existence and relevance to machine learning;
the model must include a learning component, it cannot be a non-learned algorithm;
the model must actually have been trained, it cannot be a theoretical description without experimental results.

Notability

Once added to the database, models are marked as notable if they satisfyany of the following:

highly cited (over 5000 citations);
large training cost (over $1,000,000, measured in 2023 USD, or at least 1% the cost of the most expensive model trained to date, whichever is greater);
significant use (over one million monthly active users);
state of the art performance (typically on a recognised benchmark, see details below);
an equivalent level of historical significance,
notability at least as great as the criteria above, identified at the discretion of Epoch staff.

Where there are many related models, for example several checkpoints along training or several sizes of a given model family, the database preferentially includes the version that used the most compute. Other versions may be included where they are notable in their own right.

State of the art

Identifying whether a model is state-of-the-art can be a more involved process, compared to simply checking citations or the training compute budget. We consider a model to be state of the art if there is good reason to believe that it was the best existing model at the time for a task of genuine interest. The default way to provide evidence for this is state-of-the-art performance on a recognised benchmark.

To be recognised, a benchmark should haveany of the following:

100+ citations.
10+ submissions in total from 3+ research groups.
An associated publication in a reputable peer-reviewed academic venue. The publication does not need to focus exclusively on the benchmark; however, the benchmark should be a key result.

At our discretion, we may also identify models as state of the art where no benchmark result exists, but there is convincing evidence that a model truly is state-of-the-art. Eligible sources of evidence here are comparison on a non-benchmark database, a high-quality user preference study, or demonstration of state of the art capabilities. For example, GraphCast is compared against other weather prediction models on a weather database that is not a standalone benchmark. Nevertheless, we take this as convincing evidence that it is state of the art.

Historical significance

Models can be included on the grounds of historical significance if they marked a significant advance in AI history, even if they did not strictly advance the state of the art on any application. For example, many neural network breakthroughs performed worse than other ML techniques, but were directly influential for later AI development. Evidence to support this status may come from citations in later notable models, discussion in reviews or textbooks, or other unambiguous identification as an influential result.

Discretionary identification

Models can be included at the discretion of Epoch staff if they are as notable as the other models identified but not covered by the categories above. For example, we may mark a model as notable if it is on the Pareto frontier of cost-efficiency for an important task despite not having the highest performance on a benchmark.

Table 1: Examples of models evaluated against inclusion criteria.

Example	Include?	Why
Human-level control through deep reinforcement learning	Yes	Well-documented learned model, over 5000 citations, advanced state of the art for autonomous gameplay.
Stochastic Neural Analog Reinforcement Calculator	Yes	No individual associated paper, but other sources confirm its existence, and it was indisputably historically significant as one of the first neural learning systems.
Theory of neural-analog reinforcement systems and its application to the brain model problem	No	Historically significant, but no experimentally trained model; the result is entirely theoretical.
Scaling scaling laws with board games	No	Doesn’t meet any notability criteria. In addition to not being highly cited and using small compute models, there is no attempt at state of the art results. Rather, this is a paper examining scaling details.

Search process

This dataset has been collected from a variety of sources: literature reviews, historical accounts of AI development, highly-cited publications from top conferences, high-profile models from leading industry labs, bibliographies of notable papers, pre-existing datasets curating AI papers (seeAcknowledgements), and ad hoc suggestions from contributors.

We monitor news coverage, releases from key AI labs, and benchmarks to identify new models as they are released. This can lead to a lag for new models. Typically, we aim to add the most prominent releases (e.g. GPT-4) within days of release. For less prominent models, reporting lags may extend to months.

Coverage

As of November 28, 2025, the dataset contains 3204 models, of which have compute estimates.

Coverage is most thorough for language and vision models developed since 2018 (1637 models and 481 models respectively), albeit with a lag for the newest models. More specialized domains, such as audio generation, likely have worse coverage in this period.
Coverage is fair, but less thorough, for deep learning language and vision models between 2010-2018 (106 models and 133 models respectively). Again, other domains may have worse coverage.
Coverage is quite sparse for historical models before 2010 (152 models before 2010 compared to 3034 models after), particularly models outside the paradigm of deep learning. Entries here are focused on notable models mentioned in textbooks and reviews, rather than a systematic search across sources.

If you would like to ask any questions about the database, or suggest a model that should be added, contact us atdata@epoch.ai.

Records

The database focuses on information relevant to trends in AI model development. Records in the database have information about three broad areas:

Bibliographic information about the model and its associated publication, for example its title, URL, authors, citations, date, etc.

Training details such as training compute, parameters, dataset size, hardware used for training, etc.

Metadata about the record, such as notes on the above fields with supporting evidence and context, our confidence in the key models, etc.

Fields

We provide a comprehensive guide to the database’s fields below. This includes examples taken from Llama-2 70B, one of the best-documented recent models. If you would like to ask any questions about the database, or request a field that should be added, feel free to contact us atdata@epoch.ai.

Column	Type	Definition	Example from Llama 2-70B	Coverage
Abstract	Text	Abstract text from the publication associated with the model.	In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.	88% 2809 out of 3204 models
Authors	Text	Comma-separated list of authors. Authors are named in the way that they report their names in their publications, if applicable. For example, Lê Viết Quốc is credited as “Quoc V. Le” in his publications.	Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom	77% 2454 out of 3204 models
Base model	Categorical (single select)	Which base model the model was fine-tuned from, if applicable.	[empty] This is empty because Llama-2 was not finetuned from a base model. For a non-empty example, consider CodeLlama, a Llama-2 finetune. The base model would be Llama-2.	21% 662 out of 3204 models
Batch size	Numeric	Batch size used during training.	4000000	7% 229 out of 3204 models
Citations	Numeric	Number of citations as of last update. Values are collected from Semantic Scholar where available, otherwise manually from Google Scholar.	13977	40% 1268 out of 3204 models
Confidence	Categorical (single select)	Metadata describing our confidence in the recorded values forTraining compute,Parameters, andTraining dataset size. This describes confidence for themost uncertain of these parameters, where they have a non-empty entry (compute is typically the most uncertain). The confidence statuses are defined as followed: Confident: This is based on primary or reliable sources who are providing specific technical details that allow us to make this estimate using a methodology we are confident is correct. If a lab discloses in a technical paper that a model was trained on 30,000 H100s for 45 days, we would make a confident estimate. Deepseek v3 is in this category. Likely: Similar to confident in terms of methodology, but the inputs to the methodology are not as reliable. For instance, we might base an estimate of how many GPUs were used for a training run on a detailed leak, a less reliable source (ie, what an engineer discusses on a podcast), the size of a datacenter we are tracking, or a vague but trustworthy source. Grok 3 is in this category, Speculative: Our best guess, based on methodologies or sources we think are informative but don’t have confidence in. For instance, we might base an estimate on the compute used to train other models with similar capabilities. Speculative estimates may change substantially as more information comes to light. An example of a speculative estimate would be GPT-4o, where we used benchmark scores to estimate how much compute was required for training. We also provide further statuses: Unknown - we have too little information to even make a speculative estimate. Wrong - we know this estimate is incorrect, and it has been queued for correction Unverified - this estimate has not yet been assessed for confidence.	Confident	100% 3204 out of 3204 models
Country (of organization)	Categorical (multiple select)	Country/countries associated with the developing organization(s).Multinational is used to mark organizations associated with multiple countries.	United States of America	97% 3117 out of 3204 models
Domain	Categorical (multiple select)	The machine learning domain(s) of application associated with the model. This is fairly high-level, for example “Language” incorporates many different ML tasks. Possible values: 3D modeling, Astronomy, Audio, Biology, Cybersecurity, Driving, Earth science, Games, Image generation, Language, Materials science, Mathematics, Medicine, Multimodal, Other, Psychology, Recommendation, Robotics, Search, Speech, Video, Vision	Language	100% 3204 out of 3204 models
Task	Categorical (multiple select)	The finegrained task(s) that the model is designed to perform. These are specific applications of the model to different problems, and can span multiple domains. Task labels are assigned by following aflowchart. Each applicable branch of the flowchart is followed until a leaf node is reached. If the task is already in the database, the model is tagged with that task. If the task does not yet exist in the database, the model is tagged with that task and the task is added to the flowchart. Examples: Face recognition Visual question answering Tic Tac Toe Weather forecasting	Language modeling, Language modeling/generation, Question answering	96% 3086 out of 3204 models
Epochs	Numeric	How many epochs (repetitions of the training dataset) was used to train the model.	1	24% 782 out of 3204 models
Finetune compute (FLOP)	Numeric	Compute used to fine-tune the model, if applicable.	[empty]	8% 258 out of 3204 models
Hardware quantity	Numeric	Indicates the quantity of the hardware used in training, i.e. the number of chips.	1000	26% 831 out of 3204 models
Hardware utilization (MFU)	Numeric	Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs. This value reflects utilization based on computations successfully applied to model training, and does not include computations performed by the hardware which do not ultimately affect the model.	0.4191975017	2% 58 out of 3204 models
Hardware utilization (HFU)	Numeric	Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs. This value reflects utilization based on measured computational throughput in the hardware during training. (Model FLOPs utilization is a better measure of utilization if it is available.)	[empty]	1% 24 out of 3204 models
Link	URL	Link(s) to best-choice sources documenting a model. This should preferentially be a journal or conference paper, preprint, or technical report. If these are not available, the links should point to other supporting evidence, such as an announcement post, a news article, or similar.	https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/https://arxiv.org/abs/2307.09288	99% 3170 out of 3204 models
Model	Text	The name of the model. This should be unique within the database, and should be the best-known name for a given model. This column must be filled in, and is used as the primary key for indexing entries in the dataset.	Llama 2-70B	100% 3204 out of 3204 models
Notability criteria	Categorical (multiple select)	The criteria met by the model which qualify it for notability. To be notable, a model must meet at least one criterion. Possible values are highly cited, large training cost, significant use, state of the art, or historical significance. These are discussed further inInclusion.	Historical significance, Significant use, Highly cited, Training cost	28% 905 out of 3204 models
Organization	Categorical (multiple select)	Organization(s) who created the model. Organizations may have multiple different names, but we aim to standardize organization names where they refer to the same organization. Therefore, organizations are periodically reviewed in Airtable and standardized to the most common name for them. For example, “University of California, Berkeley” and “Berkeley” have been changed to “UC Berkeley”. Note that some organizations have similar names but genuinely are different organizations, for example Google Brain versus Google versus Google DeepMind.	Meta AI	97% 3123 out of 3204 models
Organization categorization	Categorical (multiple select)	Categorization of the organization(s), automatically populated from the Organization entry. Models are categorized as “Industry” if their authors are affiliated with private sector organizations, “Academia” if the authors are affiliated with universities or academic institutions, or “Industry - Academia Collaboration” when at least 30% of the authors are from each. Possible values: Industry, Research Collective, Academia, Industry - Academia Collaboration (Industry leaning), Industry - Academia Collaboration (Academia leaning), Non-profit	Industry	97% 3105 out of 3204 models
Parameters	Numeric	Number of learnable parameters in the model. For neural networks, these are the weights and biases. Further information is provided inEstimation.	7.0e10	65% 2070 out of 3204 models
Publication date	Date	The publication, announcement, or release date of the model, in YYYY-MM-DD format. If the year and month are known but the day is unknown, the day is filled in as YYYY-MM-15. If the year is known but the month and day are unknown, the month and day are filled in as YYYY-07-01.	2023-07-18	99% 3186 out of 3204 models
Reference	Text	The literature reference for the model, such as the title of the journal or conference paper, academic preprint, or technical report.	Llama 2: Open Foundation and Fine-Tuned Chat Models	95% 3047 out of 3204 models
Training compute (FLOP)	Numeric	Quantity of compute used to train the model, in FLOP. This is the total training compute for a given model, i.e. pretrain + finetune. It should be filled in here when directly reported, or calculated via GPU-hours or backpropagation gradient updates. Further guidance is provided inEstimation.	8.1e23	43% 1368 out of 3204 models
Training compute cost (2023 USD)	Numeric	The training compute cost, estimated using the “amortized hardware capex plus energy” approach documented in ourtraining cost methodology. Values are converted to 2023 US dollars.	1,102,561	7% 221 out of 3204 models
Training compute estimation method	Categorical (multiple select)	Indicates how the quantity of training compute was found or estimated. Options include: “Reported”, when the developers report how much training compute was used. “Operation counting”, when the parameters, training data, and/or architecture are known, and these are used to estimate the number of operations performed while training. “Hardware”, when the training time and/or hardware type are known, allowing an estimate based on the rate of computation. “Cost”, when the training cost is known and the compute is estimated through the hardware usage budget “Benchmarks”, when the model’s training compute was estimated using its benchmark performance.	Hardware, Operation counting	45% 1428 out of 3204 models
Training hardware	Categorical (multiple select)	Type of training hardware used. Entries are cross-referenced against Epoch AI’s database of ML training hardware.	NVIDIA A100 SXM4 80 GB	36% 1157 out of 3204 models
Training time (hours)	Numeric	Training time of the model, if reported. This refers to the time elapsed during the training process,not the number of chip-hours. For example, if a model were trained with 10 GPUs for 1 hour, the training time would be 1 hour. Includes the duration of all training phases conducted to develop the model, such as pre-training, post-training, RL, SFT, etc. If the model is fine-tuned from a previously published model, then that base model’s training time isnot included.	1728	17% 541 out of 3204 models
Training power draw (W)	Numeric	Power draw of the hardware used to train the model, in watts. Calculated as hardware quantity times processor TDP times datacenter PUE times server overhead. More details are provided inEstimating power draw.	795557	24% 757 out of 3204 models
Frontier model	Boolean	Indicates whether a model was within the frontier, defined as models that were in the top 10 by training compute as of their release date.	False	4% 131 out of 3204 models
Possibly over 1e23 FLOP	Boolean	Indicates whether a model was (or may have been) trained with at least 10²³ floating-point operations, which qualifies it for inclusion in the large-scale models dataset.	True	15% 487 out of 3204 models
Model accessibility	Categorical (multiple select)	The accessibility of the model in terms of whether the model weights can be downloaded or, if the model weights are not accessible, whether the model can be used in an API or product.	Open weights (restricted use)	77% 2457 out of 3204 models
Training code accessibility	Categorical (single select)	Denotes how the model can be accessed and used by the public. “Open weights (unrestricted)”, “Open weights (restricted use)” and “Open weights (non-commercial)” all mean that the model weights are downloadable by the public, but with different restrictions on use. “API access” means the model can only be interacted with via an application programming interface, and possibly also a hosted service. “Hosted access (no API)” means the model can only be interacted with via a hosted service. “Unreleased” means there is no way for the public to access the model.	Unreleased	71% 2264 out of 3204 models
Notes fields, e.g. “Training compute notes”	Text	Metadata documenting the reasoning and/or evidence for a given column, e.g. training compute or dataset size. This is particularly important to note in cases where such information isn’t obvious. This field is unstructured text.	Training compute notes: "Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB" of which 1720320 GPU hours were used to train the 70B model.311.84 BF16 TFLOP/s * 1720320 hours * 0.40 utilization = 7.725e+23 FLOP.Alternatively: the model was trained for 1 epoch on 2 trillion tokens and has 70B parameters. C = 6ND = 670B2T = 8.4e+23 FLOP.	50% 1593 out of 3204 models

Database Updates

This section provides more information about recurring processes in the database: adding new models, updating citation counts, and updating the hosted files by which the dataset can be accessed for analysis.

Adding new models

Entries are added to the dataset near-daily, including both newly-released models and older models newly identified as notable. Typically, most information that can easily be determined from public information is added at the time a model is entered in the database. However, it is common for some information to gradually be entered later. For example, a compute estimate might be omitted at first and only added after we devote further effort to calculating it.

If you would like to ask any questions about the database, or suggest a model that should be added, feel free to contact us atdata@epoch.ai.

Updating citation counts

When models are added to the database, citation counts are recorded for those with academic publications or preprints. At the beginning of each month, citation counts are automatically updated for publications listed in Semantic Scholar. Publications not listed in Semantic Scholar rely on manual entry of citation count.

Updating hosted files

Epoch AI’s database ishosted as a CSV that is synced with the database daily. The easiest way to load the data in scripts is using the CSV URL. If you need the most up-to-date version reflecting unsynced changes, a CSV can be manually generated from thetable view on the website.

Estimation

Some fields within the database require estimation, because they are often not straightforwardly reported within papers or other sources. Here, we detail how estimation works for compute, model size, dataset size, and the metadata on estimate confidence.

Estimating compute

Training compute is one of the most important pieces of information in our dataset, as reflected in its usage across Epoch AI’s research and elsewhere. However, estimating compute can be challenging. Here we outline how compute estimation is performed in the notable models dataset.

Compute is measured in units offloating point operations (FLOP). For older models, sometimes the relevant operations were integer operations - in this case we report these instead. We do not apply any multiplier to adjust for operations potentially being more valuable under different tensor formats or precisions, for example FP16 versus FP32 or BF16. Some sources report compute in multiplications-and-addition operations, fused multiply-adds (FMAs), or similar. We treat one multadd/FMA as being equivalent to two FLOP to match typical reporting of chip performance.

For a given model in the database, training compute is provided as thetotal training compute, including pretraining, and including pretrained base models used as components. Finetuning compute is recorded in its associated column. Finetuning is distinguished by authors’ descriptions of the training as finetuning, or unambiguous use of a pretrained model in a distinct phase of training.

In the simplest case, training compute is directly reported in a paper, and we enter this figure into the database. When compute is not reported, we use two main methods to estimate it:

Hardware details and usage.
Counting the operations based on model architecture and data.

When there is enough information to count the operations, this is preferred in our dataset, because typically hardware-based estimates require assumptions about utilization, which may reduce the estimates’ accuracy.

Hardware details and usage

Hardware details and usage is a relatively straightforward way to estimate compute, when the necessary details are known:

The usage in chip-time, e.g. “trained on a cluster of 128 TPUv3 instances for two days” means 256 chip-days = 128 chips × 2 days. Sometimes this is reported as separate chips and time used, other times this may be reported directly in chip-time. When it is not reported, we may create estimates from publicly-known information, comparison to typical training runs, etc.
The type of hardware used, e.g. NVIDIA H100, TPUv3. Ideally, this is reported in the paper. Otherwise, for more speculative estimates, one may have to make assumptions based on institution and year, e.g. that Google would have used TPUs of the corresponding generation in that year.
The type of number representation used, e.g. FP32, FP16, BF16. Ideally, this is reported in the paper. When not reported, it can often be guessed. For example, the number representation was typically FP32 for models trained before 2019.

Once these details are known, the corresponding peak FLOP/s performance by hardware and number representation can be found from hardware documentation, or from the tool below. Finally, utilization rates account for real training runs falling significantly short of peak performance due to memory bottlenecks, network latency, etc. Typical utilization rates for large distributed training runs are around 30-50%. When these are not reported, they are estimated by reference to comparable models from a similar time period.

Table 2: Worked example of estimating training compute from hardware details.

ImageGPT
Some training details are provided in the blogpost: “[..]iGPT-L was trained for roughly 2500 V100-days […]” The number representation is not specified, but given this was trained by a major corporation in 2020, we assume the number format was FP16. The V100 has 125 TFLOP/s tensor FP16 performance. Assuming a utilization of 0.3, this leads to the following compute estimate: 8.1e21 FLOP = 2500 V100-days × 125e12 FLOP/s × 0.3 utilization × 86.4e3 s/day

ImageGPT

Some training details are provided in the blogpost: “[..]iGPT-L was trained for roughly 2500 V100-days […]”

The number representation is not specified, but given this was trained by a major corporation in 2020, we assume the number format was FP16.

The V100 has 125 TFLOP/s tensor FP16 performance. Assuming a utilization of 0.3, this leads to the following compute estimate:

8.1e21 FLOP = 2500 V100-days × 125e12 FLOP/s × 0.3 utilization × 86.4e3 s/day

Counting the operations

Counting the number of operations is often useful for older research, where hardware and usage details might be unavailable. The above formula sets out a widely-applicable heuristic for training compute of dense models. This works by first estimating required FLOP for a forward pass, which is approximately twice the number of connections. This can be modified by sparsity such as Mixture-of-Experts: in this case, the heuristic should use the number of connections in the number of active experts.

The forward pass FLOP is then multiplied by three to account for backward passes, as the ratio between forward and backward passes is 1:2 for non-recurrent dense models. Finally, this is multiplied by the number of passes performed on the data - the number of training examples multiplied by the number of epochs the model was trained. For transformer-based language models, this formula is equivalent to the commonly-used heuristic: Compute = 6 × # of parameters × # of training examples × # of epochs.

Sometimes, the FLOP for a forward pass is reported directly in a paper. In this case, this value can be used directly instead of 2 × # of connections. Otherwise, the FLOP for a forward pass are evaluated by summing FLOP over the network’s layers. These are set out in Table 3.

Table 3: Common neural network layers and associated FLOP per token in a forward pass.

Layer	Forward pass FLOP per token (approx)
Fully connected layer from N neurons to M neurons	2×N×M
CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P	2×H^2×W^2×C×D/S^2
Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P	2×D×H×W×C^2×K^2
RNN with bias vectors taking an input of size N and producing an output of size M	2×(N+M)×M
Fully gated GRU with bias vectors taking an input of size N and producing an output of size M	6×(N+M)×M
LSTM with bias vectors taking an input of size N and producing an output of size M	8×(N+M)×M
Word Embedding for vocabulary size V and embedding dimension W	0
Self attention layer with sequence length L, inputs of size W, key of size D and output of size N	2×W×(2×D+N) + 2×L×(D+N)
Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads	2×H×(W×(2×D+N) + L×(D+N) + N×M)

Table 4: Worked example of estimating training compute from architecture.

Attention Is All You Need
The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax. Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 2×16×(64×(2×64+64) + 20×(64+64) + 64×1024) = 2.6e6 FLOP per token. Each FCN sublayer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN sublayer has 2×2×1024×4096 = 1.7e7 FLOP per token. Summing all its layers, the encoder-decoder stack has 6 × (3 × 2.6e6 + 2 × 1.7e7) ~= 2.5e8 FLOP per token. The final linear layer has 2 × 1024 × 3e4 = 6.1e7 FLOP per token. Summing these, a forward pass takes 3.1e8 FLOP per token. The paper says they use batches of 25,000 tokens, and run the training for 300,000 steps. So the total training FLOP would be 2.5e4 × 3e5 × 3 × 3.1e8 = 6.97e18 FLOP.

Attention Is All You Need

The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax.

Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 2×16×(64×(2×64+64) + 20×(64+64) + 64×1024) = 2.6e6 FLOP per token.

Each FCN sublayer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN sublayer has 2×2×1024×4096 = 1.7e7 FLOP per token.

Summing all its layers, the encoder-decoder stack has 6 × (3 × 2.6e6 + 2 × 1.7e7) ~= 2.5e8 FLOP per token. The final linear layer has 2 × 1024 × 3e4 = 6.1e7 FLOP per token. Summing these, a forward pass takes 3.1e8 FLOP per token.

The paper says they use batches of 25,000 tokens, and run the training for 300,000 steps. So the total training FLOP would be 2.5e4 × 3e5 × 3 × 3.1e8 = 6.97e18 FLOP.

Benchmark performance

When details about model architecture, training data, hardware, and development time are scarce, it may be informative to compare the model’s performance on benchmarks to that of other models. Scaling laws can predict benchmark performance improvements against compute when scaling a given model family (for examplecoding performance for GPT-4 scaling andARC Challenge for Llama-3). When there are differences in model/data/training, benchmark performance is less predictable from compute, but neverthelessremains correlated.

This process of estimating training compute from benchmark performance can be improved by aggregating performance across many benchmarks, especially when several or many models with known training compute have been evaluated on those benchmarks.

The procedure is roughly as follows:

Collect a dataset of many benchmarks (e.g. MMLU, GPQA, BigCodeBench) and models (e.g. Llama 3, Mistral Large, Nemotron 4) with known training compute and benchmark scores.
For each benchmark, fit a curve that best matches the benchmark scores of each model as a function of their training compute.The curve-fitting procedure uses sigmoid functions, based on our studyHow Predictable is Language Model Benchmark Performance?
Using the fitted curves, impute an x-value (training compute) from the y-values (benchmark scores) for models with unknown training compute. This represents the approximate training compute necessary to achieve the benchmark scores demonstrated by those models.
- The resulting estimates are cross-validated to ensure that the fitted values are reasonable even if some benchmark evaluation datapoints are held out.
If applicable, combine any other information about the compute resources available to the developers with the evidence from the benchmarks, to obtain overall estimates of the compute used to train the models.
It may be helpful to constrain fitting to models with the most similar algorithmic efficiency. For example, when using this approach to collect information on leading LLMs from 2024, we constrained fitting to models with algorithmic efficiency similar to, or better than, the Llama 3 family.

This process is demonstrated in a public Colab notebook,Compute Estimation from Benchmark Scores. Because these compute estimates are already based on benchmark performance, they should be excluded from analyses of the relationship between benchmarks and compute. Such compute estimates can be filtered using the Training compute estimation method field.

Estimating model size

Parameter counts are often reported by the model developer, but if parameter count is not stated, it can sometimes be estimated based on provided architectural details. Similar to estimating compute, estimating parameter count requires finding a description of the architecture, i.e. type, number, and configuration of the layers, then calculating the parameters in each layer and summing them. Table 5 lists the parameter counts for different layers. Alternatively, if an architecture implementation is available in code, it can be simpler to load an architecture in code and count the number of parameters.

Table 5: Common neural network layers and parameters.

Layer	Parameters (approx)
Fully connected layer from N neurons to M neurons	N×M
CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P	D×K^2×C
Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P	D×K^2×C
RNN with bias vectors taking an input of size N and producing an output of size M	(N+M)×M
Fully gated GRU with bias vectors taking an input of size N and producing an output of size M	3×(N+M)×M
LSTM with bias vectors taking an input of size N and producing an output of size M	4×(N+M)×M
Word Embedding for vocabulary size V and embedding dimension W	W×V
Self attention layer with sequence length L, inputs of size W, key of size D and output of size N	W×(2×D+N)
Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads	H×(W×(2×D + N) + N×M)

Table 6: Worked example of estimating training model size from architecture.

Attention Is All You Need
The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax. Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 16×(64×(2×64 + 64) + 64×1024) = 1.2e6 parameters. Each FCN layer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN layer has 2×1024×4096 = 8.4e6 parameters. Summing all its layers, the encoder-decoder stack has 6 × (3 × 1.2e6 + 2 × 8.4e6) ~= 1.2e8 parameters. The final linear layer has 1024 × 3e4 = 3.1e7 parameters. Two embedding layers each have 30e3 × 1024 parameters, so 6.2e7 in total. Summing these, the model has 2.1e8 parameters, matching the reported 213 million parameters in the paper.

Attention Is All You Need

Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 16×(64×(2×64 + 64) + 64×1024) = 1.2e6 parameters.

Each FCN layer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN layer has 2×1024×4096 = 8.4e6 parameters.

Summing all its layers, the encoder-decoder stack has 6 × (3 × 1.2e6 + 2 × 8.4e6) ~= 1.2e8 parameters. The final linear layer has 1024 × 3e4 = 3.1e7 parameters. Two embedding layers each have 30e3 × 1024 parameters, so 6.2e7 in total. Summing these, the model has 2.1e8 parameters, matching the reported 213 million parameters in the paper.

Estimating power draw

The field “Training power draw (W)” contains the power draw of the hardware used to train the model, measured in watts. This field is filled in when the training hardware type and quantity are known, and is calculated as follows:

Training power draw = PUE × Server overhead × Power per GPU × Hardware quantity

where:

PUE =

1.23 in 1950-2008
1.08 in 2025

From 2009 onwards, PUE decays exponentially at a rate of ln(1.08/1.23)/16 ≈ 0.8% per year.

The value is based on thePublication date of the model.

Server overhead =

1 ifHardware quantity == 1
1.82 ifHardware quantity > 1

This formula calculates the power per chip times the number of chips to get the peak power draw of the computing hardware, and then adjusts for the server hardware needed to connect the processors, and the power usage efficiency of the facility containing the hardware.

The server overhead factor represents the power consumption of server hardware that is needed to connect multiple GPUs or TPUs to use them on the same computing task.

The value is derived from theNVIDIA DGX H100 server, comparing the power consumption of the server with GPUs to that of the GPUs alone: 10.2 kW server power consumption / (8 * 700 W H100 TDP) ≈ 1.82.

This server is chosen because it is the most typical hardware, representative of hardware used for modern notable machine learning model training. As of May 2025, the geometric mean server power overhead for hardware used to train models identified in Epoch’s AI models dataset (weighted by number of notable models trained using each hardware type) was 1.79, very similar to the overhead of the DGX H100 server. The H100 also makes up the majority of total training compute of notable models in Epoch’s AI models dataset and the majority of the installed AI cluster compute capacity in Epoch’s AI supercomputers dataset; Hopper chips including the H100 constitute the majority ofNVIDIA AI chip compute stock as of EOY 2024.

Thepower usage efficiency factor is the ratio of energy consumed by a data center facility to the power delivered to computing equipment, and represents the additional energy overhead required to run processors and servers due to cooling and other non-computing power consumption. We select values of 1.23 in 2008 and before, 1.08 in 2025, and exponentially decaying at a rate of ln(1.08/1.23)/16 ≈ 0.8% per year from 2009 onwards, based onGoogle’s datacenter efficiency disclosures. Other hyperscalers,such as Meta, have similar datacenter efficiency. Non-AI datacenters tend to have lower efficiency, but the industry averagefollows a similar trend.

Estimating numerical format

To facilitate data collection, we utilize a LLM-based categorization pipeline. On a validation set of 192 models, the error rate was 4%. Seehere for methodological details.

Estimating confidence

As discussed inRecords, the confidence statuses specify the following bounds as 90% confidence intervals:

Confident - ±3x (~0.5 orders of magnitude).
Likely - ±10x (1 order of magnitude).
Speculative - ±31x ( ~1.5 orders of magnitude).

Confidence applies to the recorded values forTraining compute,Parameters, andTraining dataset size. It describes confidence for themost uncertain of these values, for the ones that have a non-empty entry.

To estimate confidence statuses, we consider which parts of an estimate are uncertain, and how large the uncertainty is.

If details (compute, model size, and dataset size) are all directly reported, then the value isConfident. There is little room for error.
If a detail is estimated without any assumptions having to be made, then the value isConfident. For example, if hardware type, quantity, training time, number format, and utilization are all reported, then the ensuing compute estimate is unambiguous.
When details are ambiguous, and an assumption has to be made, we consider the uncertainty in that assumption.
For example, it is often necessary to estimate utilization when estimating training compute from hardware details. Given the typical language modeling range is 0.3-0.5, this estimate should fall within theConfident category.
Further ambiguity may move estimates into theLikely category. For example, MedBERT was trained for one week using one V100 GPU, but the authors do not report the arithmetic precision or usage of tensor cores during training, which could affect the compute usage by a factor of 4x.
Finally, some estimates are based almost entirely on credible ranges for (unreported) key parameters such as training time and hardware. These typically fall into theSpeculative category. An example of this is GPT-4, where our compute estimate is based on secondhand reporting that lets us roughly estimate training duration and hardware.

Changelog

2024-06-19

The documentation was updated for the launch of the database on Epoch AI’s “Data on AI” webpage.

Updates particularly affected sections on estimating compute, parameters, and dataset sizes
The confidence field was updated to be defined in terms of 90% confidence intervals for estimated values.
The documentation was restructured for clarity.

2025-05-02

Presentation of the dataset was improved in table view. Documentation was clarified for several fields.

Column order was changed to present parameter count, training compute, and training data near the left of the table, to reduce scrolling.
The inclusion criterion for the frontier model indicator was clarified.
The formula for training power draw was updated to account for models trained before modern datacenters.

2025-07-22

Documentation was updated to expand its scope. This documentation now covers all machine learning models recorded by Epoch AI. (Previously, only notable models were documented.)

Downloads

We offer four downloads from the AI Models dataset.Notable AI models are those that meet our notability criteria, and are the subset we recommend for data analysis.Frontier models are models that were in the top 10 of training compute as of the time of their release.Large-scale models are models trained with at least 10²³ floating-point operations.All models shows every model in the dataset, including models that do not qualify for the above categories.

Notable AI Models

CSV, Updated November 26, 2025

Frontier AI Models

CSV, Updated November 28, 2025

Large-scale AI Models

CSV, Updated November 10, 2025

All AI Models

CSV, Updated November 28, 2025

Acknowledgements

We would like to thank the authors of several sources where we have found one or more ML models to include in the database: Stanford CRFM’sfoundation model ecosystem graph,AI Tracker, Stella Biderman’sdirectory of LLMs, Terry Um’srepo of deep learning papers, Alan Thompson’smodels table, the OpenCompassChinese LM leaderboard, theAkronomikon by LightOn AI,Papers With Code, the Metaculus2022 AI Forecasting Database,Hugging Face, andBiology + AI Daily Papers. We would also like to thank the authors ofAI and compute andCompute and Energy Consumption Trends in Deep Learning Inference.

The data have been collected by Epoch AI’s employees and collaborators, including Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón, Matthew Burtell, Lennart Heim, Amogh B. Nanjajjar, Tilman Rauker, Nuño Sempere, Max Rauker, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Jean-Stanislas Denain, Owen Dudney, David Atkinson, Ben Cottier, David Owen, Robi Rahman, Carl Guo, Josh You, Nicole Maug, Aidan O’Gara, Bartosz Podkanowicz, Luke Frymire, Natalia Martemianova, Lovis Heindrich, James Sanders, David Atanasov, Veronika Blablova, Amy Ngo, John Croxton, and Yafah Edelman.

This documentation was written by David Owen and Robi Rahman. Material on estimating compute, parameters and dataset sizes was adapted from previous documents by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, Anson Ho, Pablo Villalobos, and Robi Rahman.

Movatterモバイル変換

AI Models Documentation

Overview

Citation

BibTeX citation

Inclusion

Criteria

Notability

State of the art

Historical significance

Discretionary identification

Search process

Coverage

Records

Fields

Database Updates

Adding new models

Updating citation counts

Updating hosted files

Estimation

Estimating compute

Hardware details and usage

Counting the operations

Benchmark performance

Estimating model size

Estimating power draw

Estimating numerical format

Estimating confidence

Changelog

Downloads

Notable AI Models

Frontier AI Models

Large-scale AI Models

All AI Models

Acknowledgements

We value your privacy