Movatterモバイル変換

[0]ホーム

Jump to content

Neural processing unit

Edit links

From Wikipedia, the free encyclopedia

(Redirected fromAI accelerator)

Hardware acceleration unit for artificial intelligence tasks

AHailo AI Accelerator Module attached to aRaspberry Pi 5 via an M.2 adapter hat (2024)

Aneural processing unit (NPU), also known as anAI accelerator ordeep learning processor, is a class of specializedhardware accelerator^[1] or computer system^[2]^[3] designed to accelerateartificial intelligence (AI) andmachine learning applications, includingartificial neural networks andcomputer vision.

Use

[edit]

Their purpose is either to efficiently execute already trained AI models (inference) or to train AI models. Their applications includealgorithms forrobotics,Internet of things, anddata-intensive or sensor-driven tasks.^[4] They are oftenmanycore orspatial designs and focus onlow-precision arithmetic, noveldataflow architectures, orin-memory computing capability. As of 2024^[update], a widely used datacenter-grade AIintegrated circuit chip, theNvidia H100 GPU,contains tens of billions ofMOSFETs.^[5]

Consumer devices

[edit]

AI accelerators are used in mobile devices such as AppleiPhones, AMDAI engines^[6] in Versal and NPUs,Huawei, andGoogle Pixel smartphones,^[7] and seen in manyApple silicon,Qualcomm,Samsung, andGoogle Tensor smartphone processors.^[8]

It is more recently (circa 2022) added to computer processors fromIntel,^[9]AMD,^[10] and Apple silicon.^[11] All models of IntelMeteor Lake processors have a built-inversatile processor unit (VPU) for acceleratinginference for computer vision and deep learning.^[12]

On consumer devices, the NPU is intended to be small, power-efficient, but reasonably fast when used to run small models. To do this they are designed to support low-bitwidth operations using data types such as INT4, INT8,FP8, andFP16. A common metric is trillions of operations per second (TOPS). Although TOPS does not explicitly specify the kind of operations, it is typically INT8 additions and multiplications.^[13]

Datacenters

[edit]

The GoogleTensor Processing Unit (TPU) v4 package (ASIC in center plus 4 HBM stacks) and printed circuit board (PCB) with 4 liquid-cooled packages; the board's front panel has 4 top-side PCIe connectors (2023).

Accelerators are used incloud computing servers: e.g.,tensor processing units (TPU) forGoogle Cloud Platform,^[14] andTrainium andInferentia chips forAmazon Web Services.^[15] Many vendor-specific terms exist for devices in this category, and it is anemerging technology without adominant design.

Since the late 2010s,graphics processing units designed by companies such asNvidia andAMD often include AI-specific hardware in the form of dedicated functional units for low-precisionmatrix-multiplication operations. These GPUs are commonly used as AI accelerators, both fortraining andinference.^[16]

Scientific computation

[edit]

Although NPUs are tailored for low-precision (e.g. FP16, INT8)matrix multiplication operations, they can be used to emulate higher-precision matrix multiplications in scientific computing. As modern GPUs place much focus on making the NPU part fast, using emulated FP64 (Ozaki scheme) on NPUs can potentially outperform native FP64: this has been demonstrated using FP16-emulated FP64 on NVIDIA TITAN RTX and using INT8-emulated FP64 on NVIDIA consumer GPUs and the A100 GPU. (Consumer GPUs are especially benefitted by this scheme as they have small amounts of FP64 hardware capacity, showing a 6× speedup.)^[17] Since CUDA Toolkit 13.0 Update 2, cuBLAS automatically uses INT8-emulated FP64 matrix multiplication of the equivalent precision if it's faster than native. This is in addition to the FP16-emulated FP32 feature introduced in version 12.9.^[18]

Programming

[edit]

An operating system or a higher-level library may provideapplication programming interfaces such asTensorFlow Lite with LiteRT Next (Android) or CoreML (iOS, macOS). Formats such asONNX are used to represent trained neural networks.

Consumer CPU-integrated NPUs are accessible through vendor-specific APIs. AMD (Ryzen AI), Intel (OpenVINO), Apple silicon (CoreML),^[a] and Qualcomm (SNPE) each have their own APIs, which can be built upon by a higher-level library.

GPUs generally use existingGPGPU pipelines such asCUDA andOpenCL adapted for lower precisions and specialized matrix-multiplication operations.Vulkan is also being used. Custom-built systems such as the GoogleTPU use private interfaces.

There are a large number of separate underlying acceleration APIs and compilers/runtimes in use in the AI field, causing a great increase in software development effort due to the many combinations involved. As of 2025, the open standard organizationKhronos Group is pursuing standardization of AI-related interfaces to reduce the amount of work needed. Khronos is working on three separate fronts: expansion of data types and intrinsic operations in OpenCL and Vulkan, inclusion of compute graphs inSPIR-V, and aNNEF/SkriptND file format for describing a neural network.^[19]

Notes

[edit]

^MLX builds atop the CPU and GPU parts, not the Apple Neural Engine (ANE) part of Apple Silicon chips. The relatively good performance is due to the use of a large, fastunified memory design.

References

[edit]

^"Intel unveils Movidius Compute Stick USB AI Accelerator". July 21, 2017. Archived fromthe original on August 11, 2017. RetrievedAugust 11, 2017.
^"Inspurs unveils GX4 AI Accelerator". June 21, 2017.
^Wiggers, Kyle (November 6, 2019) [2019],Neural Magic raises $15 million to boost AI inferencing speed on off-the-shelf processors, archived fromthe original on March 6, 2020, retrievedMarch 14, 2020
^"Google Designing AI Processors". May 18, 2016. Google using its own AI accelerators.
^Moss, Sebastian (March 23, 2022)."Nvidia reveals new Hopper H100 GPU, with 80 billion transistors".Data Center Dynamics. RetrievedJanuary 30, 2024.
^Brown, Nick (February 12, 2023)."Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation".Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. FPGA '23. New York, NY, USA: Association for Computing Machinery. pp. 91–97.arXiv:2301.13016.doi:10.1145/3543622.3573047.ISBN 978-1-4503-9417-8.
^"HUAWEI Reveals the Future of Mobile AI at IFA". Archived fromthe original on November 10, 2021. RetrievedJanuary 28, 2024.
^"Snapdragon 8 Gen 3 mobile platform"(PDF). Archived fromthe original(PDF) on October 25, 2023.
^"Intel's Lunar Lake Processors Arriving Q3 2024".Intel. May 20, 2024.
^"AMD XDNA Architecture".
^"Deploying Transformers on the Apple Neural Engine".Apple Machine Learning Research. RetrievedAugust 24, 2023.
^"Intel to Bring a 'VPU' Processor Unit to 14th Gen Meteor Lake Chips".PCMAG. August 2022.
^"A guide to AI TOPS and NPU performance metrics".
^Jouppi, Norman P.; et al. (June 24, 2017)."In-Datacenter Performance Analysis of a Tensor Processing Unit".ACM SIGARCH Computer Architecture News.45 (2):1–12.arXiv:1704.04760.doi:10.1145/3140659.3080246.
^"How silicon innovation became the 'secret sauce' behind AWS's success".Amazon Science. July 27, 2022. RetrievedJuly 19, 2024.
^Patel, Dylan; Nishball, Daniel; Xie, Myron (November 9, 2023)."Nvidia's New China AI Chips Circumvent US Restrictions".SemiAnalysis. RetrievedFebruary 7, 2024.
^Ootomo, Hiroyuki; Ozaki, Katsuhisa; Yokota, Rio (July 2024). "DGEMM on integer matrix multiplication unit".The International Journal of High Performance Computing Applications.38 (4):297–313.arXiv:2306.11975.doi:10.1177/10943420241239588.
^"Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS".NVIDIA Technical Blog. October 24, 2025.
^Tavenrath, Markus (2025).Current Status of AI-related Standardization in the Khronos Group(PDF). Global ICT Standards Conference 2025.

External links

[edit]

Nvidia Puts The Accelerator To The Metal With Pascal, The Next Platform
Eyeriss Project, Massachusetts Institute of Technology

v t e Hardware acceleration
Theory	Universal Turing machine Parallel computing Distributed computing
Applications	GPU GPGPU software DirectX Audio Digital signal processing Hardware random number generation Neural processing unit Cryptography TLS Machine vision Custom hardware attack scrypt Networking Data
Implementations	High-level synthesis C to HDL FPGA ASIC CPLD System on a chip Network on a chip
Architectures	Dataflow Transport triggered Multicore Manycore Heterogeneous In-memory computing Systolic array Neuromorphic
Related	Programmable logic Processor design chronology Digital electronics Virtualization Hardware emulation Logic synthesis Embedded systems