Epoch’sAI Models Dataset is a collection of machine learning models useful for research about trends in the history and future of artificial intelligence. It includes over two thousand machine learning models, encompassing a broad range of domains and scales.
This documentation describes which models are contained within the database, its records (including data fields and definitions), processes for adding new entries and auditing accuracy. It also includes a changelog and acknowledgements.
The data contain several subsets, which can be viewed using the interactive visualization.Notable AI models are those that meet our notability criteria.Frontier models are models that were in the top 10 of training compute as of the time of their release.Large-scale models are models trained with at least 1023 floating-point operations.All models shows every model in the dataset, including models that do not qualify for the above categories, such as models found during our research on algorithmic progress, AI for biology, or other investigations.
The dataset is availableon our website as a visualization or table, and is available for download inCSV format, updated daily. For a quick-start example of loading the data and working with it in your research, see thisGoogle Colab demo notebook.
If you would like to ask any questions about the database, or suggest a model that should be added, contact us atdata@epoch.ai.
If this dataset is useful for you, please cite it.
Epoch AI, ‘Parameter, Compute and Data Trends in Machine Learning’. Published online at epoch.ai. Retrieved from: ‘https://epoch.ai/data/ai-models’ [online resource]@misc{epoch2022pcdtrends, title = "Parameter, Compute and Data Trends in Machine Learning", author = {{Epoch AI}}, year = 2022, url = {https://epoch.ai/data/ai-models}, note = "Accessed:"}The database covers AI models, especially models that are notable for advancing the state of the art, or a large impact on the world or the history of the field. Here, we give an overview of how the data have been collected, and define the criteria for inclusion and notability.
To be included in the database, an ML model must satisfyall inclusion criteria:
Once added to the database, models are marked as notable if they satisfyany of the following:
Where there are many related models, for example several checkpoints along training or several sizes of a given model family, the database preferentially includes the version that used the most compute. Other versions may be included where they are notable in their own right.
Identifying whether a model is state-of-the-art can be a more involved process, compared to simply checking citations or the training compute budget. We consider a model to be state of the art if there is good reason to believe that it was the best existing model at the time for a task of genuine interest. The default way to provide evidence for this is state-of-the-art performance on a recognised benchmark.
To be recognised, a benchmark should haveany of the following:
At our discretion, we may also identify models as state of the art where no benchmark result exists, but there is convincing evidence that a model truly is state-of-the-art. Eligible sources of evidence here are comparison on a non-benchmark database, a high-quality user preference study, or demonstration of state of the art capabilities. For example, GraphCast is compared against other weather prediction models on a weather database that is not a standalone benchmark. Nevertheless, we take this as convincing evidence that it is state of the art.
Models can be included on the grounds of historical significance if they marked a significant advance in AI history, even if they did not strictly advance the state of the art on any application. For example, many neural network breakthroughs performed worse than other ML techniques, but were directly influential for later AI development. Evidence to support this status may come from citations in later notable models, discussion in reviews or textbooks, or other unambiguous identification as an influential result.
Models can be included at the discretion of Epoch staff if they are as notable as the other models identified but not covered by the categories above. For example, we may mark a model as notable if it is on the Pareto frontier of cost-efficiency for an important task despite not having the highest performance on a benchmark.
| Example | Include? | Why |
|---|---|---|
| Human-level control through deep reinforcement learning | Yes | Well-documented learned model, over 5000 citations, advanced state of the art for autonomous gameplay. |
| Stochastic Neural Analog Reinforcement Calculator | Yes | No individual associated paper, but other sources confirm its existence, and it was indisputably historically significant as one of the first neural learning systems. |
| Theory of neural-analog reinforcement systems and its application to the brain model problem | No | Historically significant, but no experimentally trained model; the result is entirely theoretical. |
| Scaling scaling laws with board games | No | Doesn’t meet any notability criteria. In addition to not being highly cited and using small compute models, there is no attempt at state of the art results. Rather, this is a paper examining scaling details. |
This dataset has been collected from a variety of sources: literature reviews, historical accounts of AI development, highly-cited publications from top conferences, high-profile models from leading industry labs, bibliographies of notable papers, pre-existing datasets curating AI papers (seeAcknowledgements), and ad hoc suggestions from contributors.
We monitor news coverage, releases from key AI labs, and benchmarks to identify new models as they are released. This can lead to a lag for new models. Typically, we aim to add the most prominent releases (e.g. GPT-4) within days of release. For less prominent models, reporting lags may extend to months.
As of November 28, 2025, the dataset contains 3204 models, of which have compute estimates.
If you would like to ask any questions about the database, or suggest a model that should be added, contact us atdata@epoch.ai.
The database focuses on information relevant to trends in AI model development. Records in the database have information about three broad areas:
Bibliographic information about the model and its associated publication, for example its title, URL, authors, citations, date, etc.
Training details such as training compute, parameters, dataset size, hardware used for training, etc.
Metadata about the record, such as notes on the above fields with supporting evidence and context, our confidence in the key models, etc.
We provide a comprehensive guide to the database’s fields below. This includes examples taken from Llama-2 70B, one of the best-documented recent models. If you would like to ask any questions about the database, or request a field that should be added, feel free to contact us atdata@epoch.ai.
| Column | Type | Definition | Example from Llama 2-70B | Coverage |
|---|---|---|---|---|
| Abstract | Text | Abstract text from the publication associated with the model. | In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. | 88% 2809 out of 3204 models |
| Authors | Text | Comma-separated list of authors. | Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom | 77% 2454 out of 3204 models |
| Base model | Categorical (single select) | Which base model the model was fine-tuned from, if applicable. | [empty] This is empty because Llama-2 was not finetuned from a base model. For a non-empty example, consider CodeLlama, a Llama-2 finetune. The base model would be Llama-2. | 21% 662 out of 3204 models |
| Batch size | Numeric | Batch size used during training. | 4000000 | 7% 229 out of 3204 models |
| Citations | Numeric | Number of citations as of last update. Values are collected from Semantic Scholar where available, otherwise manually from Google Scholar. | 13977 | 40% 1268 out of 3204 models |
| Confidence | Categorical (single select) | Metadata describing our confidence in the recorded values forTraining compute,Parameters, andTraining dataset size. This describes confidence for themost uncertain of these parameters, where they have a non-empty entry (compute is typically the most uncertain). The confidence statuses are defined as followed:
We also provide further statuses:
| Confident | 100% 3204 out of 3204 models |
| Country (of organization) | Categorical (multiple select) | Country/countries associated with the developing organization(s).Multinational is used to mark organizations associated with multiple countries. | United States of America | 97% 3117 out of 3204 models |
| Domain | Categorical (multiple select) | The machine learning domain(s) of application associated with the model. This is fairly high-level, for example “Language” incorporates many different ML tasks. Possible values: 3D modeling, Astronomy, Audio, Biology, Cybersecurity, Driving, Earth science, Games, Image generation, Language, Materials science, Mathematics, Medicine, Multimodal, Other, Psychology, Recommendation, Robotics, Search, Speech, Video, Vision | Language | 100% 3204 out of 3204 models |
| Task | Categorical (multiple select) | The finegrained task(s) that the model is designed to perform. These are specific applications of the model to different problems, and can span multiple domains. Task labels are assigned by following aflowchart. Each applicable branch of the flowchart is followed until a leaf node is reached. If the task is already in the database, the model is tagged with that task. If the task does not yet exist in the database, the model is tagged with that task and the task is added to the flowchart. Examples:
| Language modeling, Language modeling/generation, Question answering | 96% 3086 out of 3204 models |
| Epochs | Numeric | How many epochs (repetitions of the training dataset) was used to train the model. | 1 | 24% 782 out of 3204 models |
| Finetune compute (FLOP) | Numeric | Compute used to fine-tune the model, if applicable. | [empty] | 8% 258 out of 3204 models |
| Hardware quantity | Numeric | Indicates the quantity of the hardware used in training, i.e. the number of chips. | 1000 | 26% 831 out of 3204 models |
| Hardware utilization (MFU) | Numeric | Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs. This value reflects utilization based on computations successfully applied to model training, and does not include computations performed by the hardware which do not ultimately affect the model. | 0.4191975017 | 2% 58 out of 3204 models |
| Hardware utilization (HFU) | Numeric | Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs. This value reflects utilization based on measured computational throughput in the hardware during training. (Model FLOPs utilization is a better measure of utilization if it is available.) | [empty] | 1% 24 out of 3204 models |
| Link | URL | Link(s) to best-choice sources documenting a model. This should preferentially be a journal or conference paper, preprint, or technical report. If these are not available, the links should point to other supporting evidence, such as an announcement post, a news article, or similar. | https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/https://arxiv.org/abs/2307.09288 | 99% 3170 out of 3204 models |
| Model | Text | The name of the model. This should be unique within the database, and should be the best-known name for a given model. This column must be filled in, and is used as the primary key for indexing entries in the dataset. | Llama 2-70B | 100% 3204 out of 3204 models |
| Notability criteria | Categorical (multiple select) | The criteria met by the model which qualify it for notability. To be notable, a model must meet at least one criterion. Possible values are highly cited, large training cost, significant use, state of the art, or historical significance. These are discussed further inInclusion. | Historical significance, Significant use, Highly cited, Training cost | 28% 905 out of 3204 models |
| Organization | Categorical (multiple select) | Organization(s) who created the model. Organizations may have multiple different names, but we aim to standardize organization names where they refer to the same organization. Therefore, organizations are periodically reviewed in Airtable and standardized to the most common name for them. For example, “University of California, Berkeley” and “Berkeley” have been changed to “UC Berkeley”. Note that some organizations have similar names but genuinely are different organizations, for example Google Brain versus Google versus Google DeepMind. | Meta AI | 97% 3123 out of 3204 models |
| Organization categorization | Categorical (multiple select) | Categorization of the organization(s), automatically populated from the Organization entry. Models are categorized as “Industry” if their authors are affiliated with private sector organizations, “Academia” if the authors are affiliated with universities or academic institutions, or “Industry - Academia Collaboration” when at least 30% of the authors are from each. Possible values: Industry, Research Collective, Academia, Industry - Academia Collaboration (Industry leaning), Industry - Academia Collaboration (Academia leaning), Non-profit | Industry | 97% 3105 out of 3204 models |
| Parameters | Numeric | Number of learnable parameters in the model. For neural networks, these are the weights and biases. Further information is provided inEstimation. | 7.0e10 | 65% 2070 out of 3204 models |
| Publication date | Date | The publication, announcement, or release date of the model, in YYYY-MM-DD format. If the year and month are known but the day is unknown, the day is filled in as YYYY-MM-15. If the year is known but the month and day are unknown, the month and day are filled in as YYYY-07-01. | 2023-07-18 | 99% 3186 out of 3204 models |
| Reference | Text | The literature reference for the model, such as the title of the journal or conference paper, academic preprint, or technical report. | Llama 2: Open Foundation and Fine-Tuned Chat Models | 95% 3047 out of 3204 models |
| Training compute (FLOP) | Numeric | Quantity of compute used to train the model, in FLOP. This is the total training compute for a given model, i.e. pretrain + finetune. It should be filled in here when directly reported, or calculated via GPU-hours or backpropagation gradient updates. Further guidance is provided inEstimation. | 8.1e23 | 43% 1368 out of 3204 models |
| Training compute cost (2023 USD) | Numeric | The training compute cost, estimated using the “amortized hardware capex plus energy” approach documented in ourtraining cost methodology. Values are converted to 2023 US dollars. | 1,102,561 | 7% 221 out of 3204 models |
| Training compute estimation method | Categorical (multiple select) | Indicates how the quantity of training compute was found or estimated. Options include:
| Hardware, Operation counting | 45% 1428 out of 3204 models |
| Training hardware | Categorical (multiple select) | Type of training hardware used. Entries are cross-referenced against Epoch AI’s database of ML training hardware. | NVIDIA A100 SXM4 80 GB | 36% 1157 out of 3204 models |
| Training time (hours) | Numeric | Training time of the model, if reported. This refers to the time elapsed during the training process,not the number of chip-hours. For example, if a model were trained with 10 GPUs for 1 hour, the training time would be 1 hour. Includes the duration of all training phases conducted to develop the model, such as pre-training, post-training, RL, SFT, etc. If the model is fine-tuned from a previously published model, then that base model’s training time isnot included. | 1728 | 17% 541 out of 3204 models |
| Training power draw (W) | Numeric | Power draw of the hardware used to train the model, in watts. Calculated as hardware quantity times processor TDP times datacenter PUE times server overhead. More details are provided inEstimating power draw. | 795557 | 24% 757 out of 3204 models |
| Frontier model | Boolean | Indicates whether a model was within the frontier, defined as models that were in the top 10 by training compute as of their release date. | False | 4% 131 out of 3204 models |
| Possibly over 1e23 FLOP | Boolean | Indicates whether a model was (or may have been) trained with at least 1023 floating-point operations, which qualifies it for inclusion in the large-scale models dataset. | True | 15% 487 out of 3204 models |
| Model accessibility | Categorical (multiple select) | The accessibility of the model in terms of whether the model weights can be downloaded or, if the model weights are not accessible, whether the model can be used in an API or product. | Open weights (restricted use) | 77% 2457 out of 3204 models |
| Training code accessibility | Categorical (single select) | Denotes how the model can be accessed and used by the public. “Open weights (unrestricted)”, “Open weights (restricted use)” and “Open weights (non-commercial)” all mean that the model weights are downloadable by the public, but with different restrictions on use. “API access” means the model can only be interacted with via an application programming interface, and possibly also a hosted service. “Hosted access (no API)” means the model can only be interacted with via a hosted service. “Unreleased” means there is no way for the public to access the model. | Unreleased | 71% 2264 out of 3204 models |
| Notes fields, e.g. “Training compute notes” | Text | Metadata documenting the reasoning and/or evidence for a given column, e.g. training compute or dataset size. This is particularly important to note in cases where such information isn’t obvious. This field is unstructured text. | Training compute notes: "Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB" of which 1720320 GPU hours were used to train the 70B model.311.84 BF16 TFLOP/s * 1720320 hours * 0.40 utilization = 7.725e+23 FLOP.Alternatively: the model was trained for 1 epoch on 2 trillion tokens and has 70B parameters. C = 6ND = 6*70B*2T = 8.4e+23 FLOP. | 50% 1593 out of 3204 models |
This section provides more information about recurring processes in the database: adding new models, updating citation counts, and updating the hosted files by which the dataset can be accessed for analysis.
Entries are added to the dataset near-daily, including both newly-released models and older models newly identified as notable. Typically, most information that can easily be determined from public information is added at the time a model is entered in the database. However, it is common for some information to gradually be entered later. For example, a compute estimate might be omitted at first and only added after we devote further effort to calculating it.
If you would like to ask any questions about the database, or suggest a model that should be added, feel free to contact us atdata@epoch.ai.
When models are added to the database, citation counts are recorded for those with academic publications or preprints. At the beginning of each month, citation counts are automatically updated for publications listed in Semantic Scholar. Publications not listed in Semantic Scholar rely on manual entry of citation count.
Epoch AI’s database ishosted as a CSV that is synced with the database daily. The easiest way to load the data in scripts is using the CSV URL. If you need the most up-to-date version reflecting unsynced changes, a CSV can be manually generated from thetable view on the website.
Some fields within the database require estimation, because they are often not straightforwardly reported within papers or other sources. Here, we detail how estimation works for compute, model size, dataset size, and the metadata on estimate confidence.
Training compute is one of the most important pieces of information in our dataset, as reflected in its usage across Epoch AI’s research and elsewhere. However, estimating compute can be challenging. Here we outline how compute estimation is performed in the notable models dataset.
Compute is measured in units offloating point operations (FLOP). For older models, sometimes the relevant operations were integer operations - in this case we report these instead. We do not apply any multiplier to adjust for operations potentially being more valuable under different tensor formats or precisions, for example FP16 versus FP32 or BF16. Some sources report compute in multiplications-and-addition operations, fused multiply-adds (FMAs), or similar. We treat one multadd/FMA as being equivalent to two FLOP to match typical reporting of chip performance.
For a given model in the database, training compute is provided as thetotal training compute, including pretraining, and including pretrained base models used as components. Finetuning compute is recorded in its associated column. Finetuning is distinguished by authors’ descriptions of the training as finetuning, or unambiguous use of a pretrained model in a distinct phase of training.
In the simplest case, training compute is directly reported in a paper, and we enter this figure into the database. When compute is not reported, we use two main methods to estimate it:
When there is enough information to count the operations, this is preferred in our dataset, because typically hardware-based estimates require assumptions about utilization, which may reduce the estimates’ accuracy.
Hardware details and usage is a relatively straightforward way to estimate compute, when the necessary details are known:
Once these details are known, the corresponding peak FLOP/s performance by hardware and number representation can be found from hardware documentation, or from the tool below. Finally, utilization rates account for real training runs falling significantly short of peak performance due to memory bottlenecks, network latency, etc. Typical utilization rates for large distributed training runs are around 30-50%. When these are not reported, they are estimated by reference to comparable models from a similar time period.
| ImageGPT |
|---|
Some training details are provided in the blogpost: “[..]iGPT-L was trained for roughly 2500 V100-days […]” The number representation is not specified, but given this was trained by a major corporation in 2020, we assume the number format was FP16. The V100 has 125 TFLOP/s tensor FP16 performance. Assuming a utilization of 0.3, this leads to the following compute estimate: 8.1e21 FLOP = 2500 V100-days × 125e12 FLOP/s × 0.3 utilization × 86.4e3 s/day |
Counting the number of operations is often useful for older research, where hardware and usage details might be unavailable. The above formula sets out a widely-applicable heuristic for training compute of dense models. This works by first estimating required FLOP for a forward pass, which is approximately twice the number of connections. This can be modified by sparsity such as Mixture-of-Experts: in this case, the heuristic should use the number of connections in the number of active experts.
The forward pass FLOP is then multiplied by three to account for backward passes, as the ratio between forward and backward passes is 1:2 for non-recurrent dense models. Finally, this is multiplied by the number of passes performed on the data - the number of training examples multiplied by the number of epochs the model was trained. For transformer-based language models, this formula is equivalent to the commonly-used heuristic: Compute = 6 × # of parameters × # of training examples × # of epochs.
Sometimes, the FLOP for a forward pass is reported directly in a paper. In this case, this value can be used directly instead of 2 × # of connections. Otherwise, the FLOP for a forward pass are evaluated by summing FLOP over the network’s layers. These are set out in Table 3.
| Layer | Forward pass FLOP per token (approx) |
|---|---|
| Fully connected layer from N neurons to M neurons | 2×N×M |
| CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | 2×H^2×W^2×C×D/S^2 |
| Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | 2×D×H×W×C^2×K^2 |
| RNN with bias vectors taking an input of size N and producing an output of size M | 2×(N+M)×M |
| Fully gated GRU with bias vectors taking an input of size N and producing an output of size M | 6×(N+M)×M |
| LSTM with bias vectors taking an input of size N and producing an output of size M | 8×(N+M)×M |
| Word Embedding for vocabulary size V and embedding dimension W | 0 |
| Self attention layer with sequence length L, inputs of size W, key of size D and output of size N | 2×W×(2×D+N) + 2×L×(D+N) |
| Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads | 2×H×(W×(2×D+N) + L×(D+N) + N×M) |
| Attention Is All You Need |
|---|
The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax. Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 2×16×(64×(2×64+64) + 20×(64+64) + 64×1024) = 2.6e6 FLOP per token. Each FCN sublayer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN sublayer has 2×2×1024×4096 = 1.7e7 FLOP per token. Summing all its layers, the encoder-decoder stack has 6 × (3 × 2.6e6 + 2 × 1.7e7) ~= 2.5e8 FLOP per token. The final linear layer has 2 × 1024 × 3e4 = 6.1e7 FLOP per token. Summing these, a forward pass takes 3.1e8 FLOP per token. The paper says they use batches of 25,000 tokens, and run the training for 300,000 steps. So the total training FLOP would be 2.5e4 × 3e5 × 3 × 3.1e8 = 6.97e18 FLOP. |
When details about model architecture, training data, hardware, and development time are scarce, it may be informative to compare the model’s performance on benchmarks to that of other models. Scaling laws can predict benchmark performance improvements against compute when scaling a given model family (for examplecoding performance for GPT-4 scaling andARC Challenge for Llama-3). When there are differences in model/data/training, benchmark performance is less predictable from compute, but neverthelessremains correlated.
This process of estimating training compute from benchmark performance can be improved by aggregating performance across many benchmarks, especially when several or many models with known training compute have been evaluated on those benchmarks.
The procedure is roughly as follows:
This process is demonstrated in a public Colab notebook,Compute Estimation from Benchmark Scores. Because these compute estimates are already based on benchmark performance, they should be excluded from analyses of the relationship between benchmarks and compute. Such compute estimates can be filtered using the Training compute estimation method field.
Parameter counts are often reported by the model developer, but if parameter count is not stated, it can sometimes be estimated based on provided architectural details. Similar to estimating compute, estimating parameter count requires finding a description of the architecture, i.e. type, number, and configuration of the layers, then calculating the parameters in each layer and summing them. Table 5 lists the parameter counts for different layers. Alternatively, if an architecture implementation is available in code, it can be simpler to load an architecture in code and count the number of parameters.
| Layer | Parameters (approx) |
|---|---|
| Fully connected layer from N neurons to M neurons | N×M |
| CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | D×K^2×C |
| Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | D×K^2×C |
| RNN with bias vectors taking an input of size N and producing an output of size M | (N+M)×M |
| Fully gated GRU with bias vectors taking an input of size N and producing an output of size M | 3×(N+M)×M |
| LSTM with bias vectors taking an input of size N and producing an output of size M | 4×(N+M)×M |
| Word Embedding for vocabulary size V and embedding dimension W | W×V |
| Self attention layer with sequence length L, inputs of size W, key of size D and output of size N | W×(2×D+N) |
| Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads | H×(W×(2×D + N) + N×M) |
| Attention Is All You Need |
|---|
The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax. Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 16×(64×(2×64 + 64) + 64×1024) = 1.2e6 parameters. Each FCN layer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN layer has 2×1024×4096 = 8.4e6 parameters. Summing all its layers, the encoder-decoder stack has 6 × (3 × 1.2e6 + 2 × 8.4e6) ~= 1.2e8 parameters. The final linear layer has 1024 × 3e4 = 3.1e7 parameters. Two embedding layers each have 30e3 × 1024 parameters, so 6.2e7 in total. Summing these, the model has 2.1e8 parameters, matching the reported 213 million parameters in the paper. |
The field “Training power draw (W)” contains the power draw of the hardware used to train the model, measured in watts. This field is filled in when the training hardware type and quantity are known, and is calculated as follows:
where:
PUE =
From 2009 onwards, PUE decays exponentially at a rate of ln(1.08/1.23)/16 ≈ 0.8% per year.
The value is based on thePublication date of the model.
Server overhead =
Hardware quantity == 1Hardware quantity > 1This formula calculates the power per chip times the number of chips to get the peak power draw of the computing hardware, and then adjusts for the server hardware needed to connect the processors, and the power usage efficiency of the facility containing the hardware.
The server overhead factor represents the power consumption of server hardware that is needed to connect multiple GPUs or TPUs to use them on the same computing task.
The value is derived from theNVIDIA DGX H100 server, comparing the power consumption of the server with GPUs to that of the GPUs alone: 10.2 kW server power consumption / (8 * 700 W H100 TDP) ≈ 1.82.
This server is chosen because it is the most typical hardware, representative of hardware used for modern notable machine learning model training. As of May 2025, the geometric mean server power overhead for hardware used to train models identified in Epoch’s AI models dataset (weighted by number of notable models trained using each hardware type) was 1.79, very similar to the overhead of the DGX H100 server. The H100 also makes up the majority of total training compute of notable models in Epoch’s AI models dataset and the majority of the installed AI cluster compute capacity in Epoch’s AI supercomputers dataset; Hopper chips including the H100 constitute the majority ofNVIDIA AI chip compute stock as of EOY 2024.
Thepower usage efficiency factor is the ratio of energy consumed by a data center facility to the power delivered to computing equipment, and represents the additional energy overhead required to run processors and servers due to cooling and other non-computing power consumption. We select values of 1.23 in 2008 and before, 1.08 in 2025, and exponentially decaying at a rate of ln(1.08/1.23)/16 ≈ 0.8% per year from 2009 onwards, based onGoogle’s datacenter efficiency disclosures. Other hyperscalers,such as Meta, have similar datacenter efficiency. Non-AI datacenters tend to have lower efficiency, but the industry averagefollows a similar trend.
To facilitate data collection, we utilize a LLM-based categorization pipeline. On a validation set of 192 models, the error rate was 4%. Seehere for methodological details.
As discussed inRecords, the confidence statuses specify the following bounds as 90% confidence intervals:
Confidence applies to the recorded values forTraining compute,Parameters, andTraining dataset size. It describes confidence for themost uncertain of these values, for the ones that have a non-empty entry.
To estimate confidence statuses, we consider which parts of an estimate are uncertain, and how large the uncertainty is.
2024-06-19
The documentation was updated for the launch of the database on Epoch AI’s “Data on AI” webpage.
2025-05-02
Presentation of the dataset was improved in table view. Documentation was clarified for several fields.
2025-07-22
Documentation was updated to expand its scope. This documentation now covers all machine learning models recorded by Epoch AI. (Previously, only notable models were documented.)
We offer four downloads from the AI Models dataset.Notable AI models are those that meet our notability criteria, and are the subset we recommend for data analysis.Frontier models are models that were in the top 10 of training compute as of the time of their release.Large-scale models are models trained with at least 1023 floating-point operations.All models shows every model in the dataset, including models that do not qualify for the above categories.
CSV, Updated November 26, 2025
CSV, Updated November 28, 2025
CSV, Updated November 10, 2025
CSV, Updated November 28, 2025
We would like to thank the authors of several sources where we have found one or more ML models to include in the database: Stanford CRFM’sfoundation model ecosystem graph,AI Tracker, Stella Biderman’sdirectory of LLMs, Terry Um’srepo of deep learning papers, Alan Thompson’smodels table, the OpenCompassChinese LM leaderboard, theAkronomikon by LightOn AI,Papers With Code, the Metaculus2022 AI Forecasting Database,Hugging Face, andBiology + AI Daily Papers. We would also like to thank the authors ofAI and compute andCompute and Energy Consumption Trends in Deep Learning Inference.
The data have been collected by Epoch AI’s employees and collaborators, including Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón, Matthew Burtell, Lennart Heim, Amogh B. Nanjajjar, Tilman Rauker, Nuño Sempere, Max Rauker, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Jean-Stanislas Denain, Owen Dudney, David Atkinson, Ben Cottier, David Owen, Robi Rahman, Carl Guo, Josh You, Nicole Maug, Aidan O’Gara, Bartosz Podkanowicz, Luke Frymire, Natalia Martemianova, Lovis Heindrich, James Sanders, David Atanasov, Veronika Blablova, Amy Ngo, John Croxton, and Yafah Edelman.
This documentation was written by David Owen and Robi Rahman. Material on estimating compute, parameters and dataset sizes was adapted from previous documents by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, Anson Ho, Pablo Villalobos, and Robi Rahman.
Help us make our website better!
Please tell us about you.
Have a question? Noticed something wrong? Let us know.
If you would like a reply, please include your name and email address.
Your comment will be reviewed. We may not be able to respond to every submission.
There’s been an error in submitting your feedback. Please try again later.