
Aneural processing unit (NPU), also known asAI accelerator ordeep learning processor, is a class of specializedhardware accelerator[1] or computer system[2][3] designed to accelerateartificial intelligence (AI) andmachine learning applications, includingartificial neural networks andcomputer vision.
Their purpose is either to efficiently execute already trained AI models (inference) or to train AI models. Their applications includealgorithms forrobotics,Internet of things, anddata-intensive or sensor-driven tasks.[4] They are oftenmanycore orspatial designs and focus onlow-precision arithmetic, noveldataflow architectures, orin-memory computing capability. As of 2024[update], a widely used datacenter-grade AIintegrated circuit chip, theNvidiaH100 GPU,contains tens of billions ofMOSFETs.[5]
AI accelerators are used in mobile devices such as AppleiPhones, AMDAI engines[6] in Versal and NPUs,Huawei, andGoogle Pixel smartphones,[7] and seen in manyApple silicon,Qualcomm,Samsung, andGoogle Tensor smartphone processors.[8]

It is more recently (circa 2022) added to computer processors fromIntel,[9]AMD,[10] and Apple silicon.[11] All models of IntelMeteor Lake processors have a built-inversatile processor unit (VPU) for acceleratinginference for computer vision and deep learning.[12]
On consumer devices, the NPU is intended to be small, power-efficient, but reasonably fast when used to run small models. To do this they are designed to support low-bitwidth operations using data types such as INT4, INT8, FP8, and FP16. A common metric is trillions of operations per second (TOPS), though this metric alone does not quantify which kind of operations are being performed.[13]
Accelerators are used incloud computing servers, includingtensor processing units (TPU) inGoogle Cloud Platform[14] andTrainium andInferentia chips inAmazon Web Services.[15] Many vendor-specific terms exist for devices in this category, and it is anemerging technology without adominant design.
Since the late 2010s,graphics processing units designed by companies such asNvidia andAMD often include AI-specific hardware in the form of dedicated functional units for low-precisionmatrix-multiplication operations. These GPUs are commonly used as AI accelerators, both fortraining andinference.[16]

Although NPUs are tailored for low-precision (e.g. FP16, INT8)matrix multiplication operations, they can be used to emulate higher-precision matrix multiplications in scientific computing. As modern GPUs place much focus on making the NPU part fast, using emulated FP64 (Ozaki scheme) on NPUs can potentially outperform native FP64: this has been demonstrated using FP16-emulated FP64 on NVIDIA TITAN RTX and using INT8-emulated FP64 on NVIDIA consumer GPUs and the A100 GPU. (Consumer GPUs are especially benefitted by this scheme as they have small amounts of FP64 hardware capacity, showing a 6× speedup.)[17] Since CUDA Toolkit 13.0 Update 2, cuBLAS automatically uses INT8-emulated FP64 matrix multiplication of the equivalent precision if it's faster than native. This is in addition to the FP16-emulated FP32 feature introduced in version 12.9.[18]
Mobile NPU vendors typically provide their ownapplication programming interface such as the Snapdragon Neural Processing Engine. An operating system or a higher-level library may provide a more generic interface such asTensorFlow Lite with LiteRT Next (Android) or CoreML (iOS, macOS).
Consumer CPU-integrated NPUs are accessible through vendor-specific APIs. AMD (Ryzen AI), Intel (OpenVINO), Apple silicon (CoreML)[a] each have their own APIs, which can be built upon by a higher-level library.
GPUs generally use existingGPGPU pipelines such asCUDA andOpenCL adapted for lower precisions. Custom-built systems such as the GoogleTPU use private interfaces.