A tensor processing unit (TPU) is an application-specific integrated circuit (ASIC) specifically designed to accelerate high-volume mathematical and logical processing tasks typically involved with machine learning (ML) workloads.
Google designed the tensor ASIC, using TPUs for in-house neural network ML projects as early as 2015 with Google's customTensorFlow software. Google released the TPU for third-party use in 2018. Today, the evolving TPU chips and TensorFlow software framework are ML infrastructure mainstays, including theGoogle Cloud Platform (GCP).
TPUs provide a limited number of features and functionalities that are directly useful to ML and artificial intelligence (AI) tasks but are not necessarily useful for everyday general computing. ML models and the AI platforms that use them, such asdeep learning andneural networks, require extensive mathematical processing. While it's possible to execute these tasks in ordinary central processing units (CPUs) or more advanced graphics processing units (GPUs), neither is optimized for such tasks.
Just as GPUs arose to speed the math processing required for gaming and data visualization, TPUs now accelerate the mathematical tasks used for neural networks and other ML models. This includes multiplication and accumulation, or addition, operations.
A TPU employs one or more extensive arrays of multiply-and-accumulate arithmetic logic units (ALUs) configured as a matrix. Thismatrix processing solves extensive mathematical tasks much faster and with far lower power consumption than more traditional processor types. In short, a TPU takes input data, breaks down the data into multiple tasks calledvectors, performs multiplication and addition on each vector simultaneously and in parallel, and then delivers the output to the ML model.
Recent TPU designs automatically adjust performance depending on the supported application type. TPUs also handle low-level dataflow graphs and tackle sophisticated graph calculations that tax traditional CPUs and GPUs. TPUs support 16-bit floating point operations and usehigh-bandwidth memory; late-model TPUv5p chips list a memory bandwidth of 2,765 GB/s.
Every processor does the same fundamental job: execute a set of instructions designed to move and perform operations on data. Performing these jobs on a hardware chip increases speed. A processor typically performs a taskin hardware within a few billionths of a second, a fast and efficient job.
However, if a processor is not designed or optimized to perform a certain task, that task becomes difficult, even impossible, to perform. Instead, a constructed software application uses the processor's available instruction set to emulate its intended function. Unfortunately, software emulation almost always results in poor or inefficient processor performance because the processor needs a lot more time to accomplish a lot more work.
Virtualization, for example, requires a continuous translation between physical and virtual hardware resources. Early virtualization software used emulation to process such translations, severely limiting the performance and number of virtual machines (VMs) supported on a computer. When processor designers added virtualization instruction sets to modern processor designs, the system suddenly and dramatically improved, allowing computers to handle many VMs simultaneously at near-processor speeds. Processors are often tailored and updated in this way to handle new processing problems.
Conversely, a processor is sometimes selected for its simplicity or suitability. Consider an automatic coffee maker. While programmable, it typically uses a small subset of processor-type instructions to function, avoiding a wasteful and expensive general-purpose processor. The resulting ASIC provides a stripped-down processing chip, allowing much faster performance and far lower power demands.
Ultimately, the correct CPU, GPU or TPU is the one that's best suited for the computing problem at hand.
The CPU is a general-purpose device designed to support more than 1,500 different instructions in hardware, or on chip. There might be several chips, or cores, incorporated into the same processor package that plugs into the computer's motherboard.
CPUs process instructions and data one at a time along an internal pipeline. This speeds up the individual processes but limits the number of simultaneous operations. CPUs can indeed support many ML models and are best applied when the model has the following properties:
The GPU provides high levels of parallel processing and supports detailed mathematical tasks that general-purpose CPUs do not handle without emulation. Such characteristics are typically useful for visualization applications, including computer games, math-intensive software applications, and three-dimensional (3D) and other rendering tools, such as AutoCAD. Since they typically do not possess basic instructions, GPUs are paired with CPUs on the same computer system.
Yet the GPU is not simply a CPU with more instructions. Instead, it's a fundamentally different approach to solving specific computing problems. The limited number of functions performed by a GPU means each core is far smaller, but its highly parallel architecture allows thousands of cores to manage massive parallel computing tasks and high data throughput. Still, the GPU cannot multitask well, and it generally has limited memory access.
GPUs are well suited to many demanding ML models and are best employed when the model has the following properties:
The TPU is much closer to an ASIC, providing a limited number of math functions, primarily matrix processing, expressly intended for ML tasks. A TPU is noted for high throughput and parallelism normally associated with GPUs but taken to extremes in its designs.
Typical TPU chips contain one or more TensorCores. Each employs matrix-multiply units (MXUs), a vector unit and a scalar unit. Every MXU incorporates an array of 128 x 128 multiply-accumulator ALUs, and each MXU performs 16,000 multiply-accumulate operations per clock cycle using floating point math.
TPUs are primarily purpose-built chips ideally suited for ML models with the following properties:
As with any type of processor, the TPU chip does nothing without software capable of employing the TPU's functions. TensorFlow software provides the framework that delivers data to the TPU and then returns results to associated ML models. TPUs are used in a variety of tasks. The most popular include the following:
The TPUs are proprietary ASIC devices developed by Google and used in GCP data centers since 2015. The chip supports Google's TensorFlow symbolic math software platform and other ML tasks involving matrix processing mathematics. Google also produces TPUs for commercial use and makes commercial TPU-based services available through GCP.
Google TPUs have undergone five major iterations since their initial introduction. The latest v5 TPU is available as both a v5e low-power economy model and a v5p fully-realized performance model.
| Feature | TPUv1 | TPUv2 | TPUv3 | TPUv4 | TPUv5e (economy) | TPUv5p (performance) |
| Year introduced | 2016 | 2017 | 2018 | 2021 | 2023 | 2023 |
| Performance (floating point) | 23 TFLOPs | 45 TFLOPs | 123 TFLOPs | 275 TFLOPs | 197 TFLOPs | 459 TFLOPs |
| Memory access | 8 GB | 16 GB | 32 GB | 32 GB | 16 GB | 95GB |
| Memory bandwidth | 34 GBps | 600 GBps | 900 GBps | 1,200 GBps | 819 GBps | 2,765 GBps |
| Chips per pod | unspecified | 256 | 1,024 | 4,096 | 256 | 8,960 |
Editor's note: This data is fromGoogle.
Data visualization translates information into charts, maps or other graphics to show patterns, trends and outliers in a way that can be grasped quickly. See More.
Multi-access edge computing (MEC) is a network architecture concept that brings cloud computing capabilities and IT services ...
Fifth-generation wireless or 5G is a global standard and technology for wireless and telecommunications networks.
A small cell is a type of low-power cellular radio access point or base station that provides wireless service within a limited ...
No longer just a good idea, IAM is a crucial piece of the cybersecurity puzzle. It's how an organization regulates access to ...
Data masking is a security technique that modifies sensitive data in a data set so it can be used safely in a non-production ...
Antivirus software (antivirus program) is a security program designed to prevent, detect, search and remove viruses and other ...
A chief data officer (CDO) in many organizations is a C-level executive whose position has evolved into a range of strategic data...
User-generated content (UGC) is published information that an unpaid contributor provides to a website.
Business process outsourcing (BPO) is a business practice in which an organization contracts with an external service provider to...
Succession planning is the strategic process of identifying and developing internal candidates to fill key organizational roles ...
Compensation management is the discipline and process for determining employees' appropriate pay, incentives, rewards, bonuses ...
HR technology (human resources tech) refers to the hardware and software that support an organization's human resource management...
A virtual agent is an AI-powered software application or service that interacts with humans or other digital systems in a ...
Customer acquisition cost (CAC) is the cost associated with convincing a consumer to buy your product or service, including ...
Direct marketing is a type of advertising campaign that seeks to elicit an action (such as an order, a visit to a store or ...