Tensor Processing Unit 3.0 | |
| Designer | |
|---|---|
| Introduced | 2015[1] |
| Type | Neural network Machine learning |
Tensor Processing Unit (TPU) is anAI acceleratorapplication-specific integrated circuit (ASIC) developed byGoogle forneural networkmachine learning, using Google's ownTensorFlow software.[2] Google began using TPUs internally in 2015, and in 2018 made them available forthird-party use, both as part of its cloud infrastructure and by offering a smaller version of the chip for sale.
Compared to agraphics processing unit, TPUs are designed for a high volume of lowprecision computation (e.g. as little as8-bit precision)[3] with more input/output operations perjoule, without hardware for rasterisation/texture mapping.[4] The TPUASICs are mounted in a heatsink assembly, which can fit in a hard drive slot within a data centerrack, according toNorman Jouppi.[5]
Different types of processors are suited for different types of machine learning models. TPUs are well suited forCNNs, while GPUs have benefits for some fully connected neural networks, and CPUs can have advantages forRNNs.[6]
According to Jonathan Ross, one of the original TPU engineers,[1] and later the founder ofGroq, three separate groups at Google were developing AI accelerators, with the TPU being the design that was ultimately selected. He was not aware ofsystolic arrays at the time and upon learning the term thought "Oh, that's called a systolic array? It just seemed to make sense."[7]
The tensor processing unit was announced in May 2016 atGoogle I/O, when the company said that the TPU had already been used insidetheir data centers for over a year.[5][4] Google's 2017 paper describing its creation cites previous systolic matrix multipliers of similar architecture built in the 1990s.[8] The chip has been specifically designed for Google'sTensorFlow framework, a symbolic math library which is used formachine learning applications such asneural networks.[9] However, as of 2017 Google still usedCPUs andGPUs for other types ofmachine learning.[5] OtherAI accelerator designs are appearing from other vendors also and are aimed atembedded androbotics markets.
Google's TPUs are proprietary. Some models are commercially available, and on February 12, 2018,The New York Times reported that Google "would allow other companies to buy access to those chips through its cloud-computing service."[10] Google has said that they were used in theAlphaGo versus Lee Sedol series of human-versus-machineGo games,[4] as well as in theAlphaZero system, which producedChess,Shogi and Go playing programs from the game rules alone and went on to beat the leading programs in those games.[11] Google has also used TPUs forGoogle Street View text processing and was able to find all the text in the Street View database in less than five days. InGoogle Photos, an individual TPU can process over 100 million photos a day.[5] It is also used inRankBrain which Google uses to provide search results.[12]
Google provides third parties access to TPUs through itsCloud TPU service as part of theGoogle Cloud Platform[13] and through itsnotebook-based servicesKaggle andColaboratory.[14][15]
Broadcom is a co-developer of TPUs, translating Google's architecture and specifications into manufacturable silicon. It provides proprietary technologies such asSerDes high-speed interfaces, overseeing ASIC design, and managing chip fabrication and packaging through third-party foundries likeTSMC, covering all generations since the program's inception.[16][17][18]
| v1 | v2 | v3 | v4[20][22][23] | v5e[24] | v5p[25][26] | v6e (Trillium)[27][28] | v7 (Ironwood)[29] | |
|---|---|---|---|---|---|---|---|---|
| Date introduced | 2015 | 2017 | 2018 | 2021 | 2023 | 2023 | 2024 | 2025 |
| Process node | 28 nm | 16 nm | 16 nm | 7 nm | Not listed | Not listed | Not listed | Not listed |
| Die size (mm2) | 331 | < 625 | < 700 | < 400 | 300–350 | Not listed | Not listed | Not listed |
| On-chip memory (MiB) | 28 | 32 | 32 (VMEM) + 5 (spMEM) | 128 (CMEM) + 32 (VMEM) + 10 (spMEM) | Not listed | Not listed | Not listed | Not listed |
| Clock speed (MHz) | 700 | 700 | 940 | 1050 | Not listed | 1750 | Not listed | Not listed |
| Memory | 8 GiBDDR3 | 16 GiBHBM | 32 GiB HBM | 32 GiB HBM | 16 GB HBM | 95 GB HBM | 32 GB | 192 GB HBM |
| Memory bandwidth | 34 GB/s | 600 GB/s | 900 GB/s | 1200 GB/s | 819 GB/s | 2765 GB/s | 1640 GB/s | 7.37 TB/s |
| Thermal design power (W) | 75 | 280 | 220 | 170 | Not listed | Not listed | Not listed | Not listed |
| Computational performance (trillion operations per second) | 23 | 45 | 123 | 275 | 197 (bf16) 393 (int8) | 459 (bf16) 918 (int8) | 918 (bf16) 1836 (int8) | 4614 (fp8) |
| Energy efficiency (teraOPS/W) | 0.31 | 0.16 | 0.56 | 1.62 | Not listed | Not listed | Not listed | 4.7 |
The first-generation TPU is an8-bitmatrix multiplication engine, driven withCISC instructions by the host processor across aPCIe 3.0 bus. It is manufactured on a 28nm process with a die size ≤ 331 mm2. Theclock speed is 700 MHz and it has athermal design power of 28–40 W. It has 28 MiB of on chip memory, and 4 MiB of32-bitaccumulators taking the results of a 256×256systolic array of 8-bitmultipliers.[8] Within the TPU package is 8 GiB ofdual-channel 2133 MHzDDR3 SDRAM offering 34 GB/s of bandwidth.[21] Instructions transfer data to or from the host, perform matrix multiplications orconvolutions, and applyactivation functions.[8]
The second-generation TPU was announced in May 2017.[30] Google stated the first-generation TPU design was limited bymemory bandwidth and using 16GB ofHigh Bandwidth Memory in the second-generation design increased bandwidth to 600 GB/s and performance to 45 teraFLOPS.[21] The TPUs are then arranged into four-chip modules with a performance of 180 teraFLOPS.[30] Then 64 of these modules are assembled into 256-chip pods with 11.5 petaFLOPS of performance.[30] Notably, while the first-generation TPUs were limited to integers, the second-generation TPUs can also calculate infloating point, introducing thebfloat16 format invented byGoogle Brain. This makes the second-generation TPUs useful for both training and inference of machine learning models. Google has stated these second-generation TPUs will be available on theGoogle Compute Engine for use in TensorFlow applications.[31]
The third-generation TPU was announced on May 8, 2018.[32] Google announced that processors themselves are twice as powerful as the second-generation TPUs, and would be deployed in pods with four times as many chips as the preceding generation.[33][34] This results in an 8-fold increase in performance per pod (with up to 1,024 chips per pod) compared to the second-generation TPU deployment.
On May 18, 2021, Google CEO Sundar Pichai spoke about TPU v4 Tensor Processing Units during his keynote at the Google I/O virtual conference. TPU v4 improved performance by more than 2x over TPU v3 chips. Pichai said "A single v4 pod contains 4,096 v4 chips, and each pod has 10x the interconnect bandwidth per chip at scale, compared to any other networking technology.”[35] An April 2023 paper by Google claims TPU v4 is 5-87% faster than an NvidiaA100 at machine learningbenchmarks.[36]
There is also an "inference" version, called v4i,[37] that does not requireliquid cooling.[38]
In 2021, Google revealed the physical layout of TPU v5 is being designed with the assistance of a novel application ofdeep reinforcement learning.[39] Google claims TPU v5 is nearly twice as fast as TPU v4,[40] and based on that and the relative performance of TPU v4 over A100, some speculate TPU v5 as being as fast as or faster than anH100.[41]
Similar to the v4i being a lighter-weight version of the v4, the fifth generation has a "cost-efficient"[42] version called v5e.[24] In December 2023, Google announced TPU v5p which is claimed to be competitive with the H100.[43]
In May 2024, at theGoogle I/O conference, Google announced TPU v6, which became available in preview in October 2024.[44] Google claimed a 4.7 times performance increase relative to TPU v5e,[45] via larger matrix multiplication units and an increased clock speed. High bandwidth memory (HBM) capacity and bandwidth have also doubled. A pod can contain up to 256 Trillium units.[46]
In April 2025, at Google Cloud Next conference, Google unveiled TPU v7. This new chip, called Ironwood, will come in two configurations: a 256-chip cluster and a 9,216-chip cluster. Ironwood will have a peak computational performance rate of 4,614 TFLOP/s.[47]
In July 2018, Google announced the Edge TPU. The Edge TPU is Google's purpose-builtASIC chip designed to run machine learning (ML) models foredge computing, meaning it is much smaller and consumes far less power compared to the TPUs hosted in Google datacenters (also known as Cloud TPUs[48]). In January 2019, Google made the Edge TPU available to developers with a line of products under theCoral brand. The Edge TPU is capable of 4 trillion operations per second with 2 W of electrical power.[49]
The product offerings include asingle-board computer (SBC), asystem on module (SoM), aUSB accessory, a miniPCI-e card, and anM.2 card. TheSBC Coral Dev Board and Coral SoM both run Mendel Linux OS – a derivative ofDebian.[50][51] The USB, PCI-e, and M.2 products function as add-ons to existing computer systems, and support Debian-based Linux systems on x86-64 and ARM64 hosts (includingRaspberry Pi).
The machine learning runtime used to execute models on the Edge TPU is based onTensorFlow Lite.[52] The Edge TPU is only capable of accelerating forward-pass operations, which means it's primarily useful for performing inferences (although it is possible to perform lightweight transfer learning on the Edge TPU[53]). The Edge TPU also only supports 8-bit math, meaning that for a network to be compatible with the Edge TPU, it needs to either be trained using the TensorFlow quantization-aware training technique, or since late 2019 it's also possible to use post-training quantization.
On November 12, 2019,Asus announced a pair ofsingle-board computer (SBCs) featuring the Edge TPU. TheAsus Tinker Edge T and Tinker Edge R Board designed forIoT andedgeAI. The SBCs officially supportAndroid andDebianoperating systems.[54][55] ASUS has also demonstrated a mini PC called Asus PN60T featuring the Edge TPU.[56]
On January 2, 2020, Google announced the Coral Accelerator Module and Coral Dev Board Mini, to be demonstrated atCES 2020 later the same month. The Coral Accelerator Module is amulti-chip module featuring the Edge TPU, PCIe and USB interfaces for easier integration. The Coral Dev Board Mini is a smallerSBC featuring the Coral Accelerator Module andMediaTek 8167s SoC.[57][58]
On October 15, 2019, Google announced thePixel 4 smartphone, which contains an Edge TPU called thePixel Neural Core. Google describe it as "customized to meet the requirements of key camera features in Pixel 4", using a neural network search that sacrifices some accuracy in favor of minimizing latency and power use.[59]
Google followed the Pixel Neural Core by integrating an Edge TPU into a customsystem-on-chip namedGoogle Tensor, which was released in 2021 with thePixel 6 line of smartphones.[60] The Google Tensor SoC demonstrated "extremely large performance advantages over the competition" in machine learning-focused benchmarks; although instantaneous power consumption also was relatively high, the improved performance meant less energy was consumed due to shorter periods requiring peak performance.[61]
In 2019, Singular Computing, founded in 2009 by Joseph Bates, avisiting professor atMIT,[62] filed suit against Google allegingpatent infringement in TPU chips.[63] By 2020, Google had successfully lowered the number of claims the court would consider to just two: claim 53 ofUS 8407273 filed in 2012 and claim 7 ofUS 9218156 filed in 2013, both of which claim adynamic range of 10−6 to 106 for floating point numbers, which the standardfloat16 cannot do (without resorting tosubnormal numbers) as it only has five bits for the exponent. In a 2023 court filing, Singular Computing specifically called out Google's use ofbfloat16, as that exceeds the dynamic range offloat16.[64] Singular claims non-standard floating point formats werenon-obvious in 2009, but Google retorts that the VFLOAT[65] format, with configurable number of exponent bits, existed asprior art in 2002.[66] By January 2024, subsequent lawsuits by Singular had brought the number of patents being litigated up to eight. Towards the end of the trial later that month, Google agreed to a settlement with undisclosed terms.[67][68]