Movatterモバイル変換

[0]ホーム

Jump to content

Hardware acceleration

Edit links

From Wikipedia, the free encyclopedia

Specialized computer hardware

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Hardware acceleration" – news ·newspapers ·books ·scholar ·JSTOR(September 2014) (Learn how and when to remove this message)

Acryptographic accelerator card allows cryptographic operations to be performed at a faster rate.

Hardware acceleration is the use ofcomputer hardware, known as ahardware accelerator, to perform specific functions faster than can be done bysoftware running on a general-purposecentral processing unit (CPU). Anytransformation ofdata that can be calculated by software running on a CPU can also be calculated by an appropriate hardware accelerator, or by a combination of both.

To perform computing tasks more efficiently, generally one can invest time and money in improving the software, improving the hardware, or both. There are various approaches with advantages and disadvantages in terms of decreasedlatency, increasedthroughput, and reducedenergy consumption.

Typical advantages of focusing on software may include greater versatility, more rapiddevelopment, lowernon-recurring engineering costs, heightenedportability, and ease ofupdating features orpatching bugs, at the cost ofoverhead to compute general operations.

Advantages of focusing on hardware may includespeedup, reducedpower consumption,^[1] lower latency, increasedparallelism^[2] andbandwidth, andbetter utilization of area andfunctional components available on anintegrated circuit; at the cost of lower ability to update designs onceetched onto silicon and higher costs offunctional verification, times to market, and the need for more parts.

In the hierarchy of digital computing systems ranging from general-purpose processors tofully customized hardware, there is a tradeoff between flexibility and efficiency, with efficiency increasing byorders of magnitude when any given application is implemented higher up that hierarchy (that is, towards the more customized end).^[3] This hierarchy includes general-purpose processors such as CPUs,^[4] more specialized processors such as programmableshaders in aGPU,^[5] applications implemented onfield-programmable gate arrays (FPGAs),^[6] and fixed-function implemented onapplication-specific integrated circuits (ASICs).^[7]

Hardware acceleration is advantageous forperformance, and practical when the functions are fixed, so updates are not as needed as in software solutions. With the advent ofreprogrammable logic devices such as FPGAs, the restriction of hardware acceleration to fully fixed algorithms has eased since 2010, allowing hardware acceleration to be applied to problem domains requiring modification to algorithms and processingcontrol flow.^[8]^[9] The disadvantage, however, is that in many open source projects, it requires proprietary libraries that not all vendors are keen to distribute or expose, making it difficult to integrate in such projects.

Overview

[edit]

Integrated circuits are designed to handle various operations on both analog and digital signals. In computing, digital signals are the most common and are typically represented as binary numbers.Computer hardware and software use thisbinary representation to perform computations. This is done by processingBoolean functions on the binary input, and then outputting the results for storage or further processing by other devices.

Computational equivalence of hardware and software

[edit]

Because allTuring machines can run anycomputable function, it is always possible to design custom hardware that performs the same function as a given piece of software. Conversely, software can always be used to emulate the function of a given piece of hardware. Custom hardware may offer higher performance per watt for the same functions that can be specified in software.Hardware description languages (HDLs) such asVerilog andVHDL can model the samesemantics as software andsynthesize the design into anetlist that can be programmed to an FPGA or composed into thelogic gates of an ASIC.

Stored-program computers

[edit]

The vast majority of software-based computing occurs on machines implementing thevon Neumann architecture, collectively known asstored-program computers.Computer programs are stored as data andexecuted byprocessors. Such processors must fetch and decode instructions, as well asload data operands frommemory (as part of theinstruction cycle), to execute the instructions constituting the software program. Relying on a commoncache for code and data leads to the "von Neumann bottleneck", a fundamental limitation on the throughput of software on processors implementing the von Neumann architecture. Even in themodified Harvard architecture, where instructions and data have separate caches in thememory hierarchy, there is overhead to decoding instructionopcodes andmultiplexing availableexecution units on amicroprocessor ormicrocontroller, leading to low circuit utilization. Modern processors that providesimultaneous multithreading exploit under-utilization of available processor functional units andinstruction level parallelism between different hardware threads.

Hardware execution units

[edit]

Hardware execution units do not in general rely on the von Neumann or modified Harvard architectures and do not need to perform the instruction fetch and decode steps of aninstruction cycle and incur those stages' overhead. If needed calculations are specified in aregister transfer level (RTL) hardware design, the time and circuit area costs that would be incurred by instruction fetch and decoding stages can be reclaimed and put to other uses.

This reclamation saves time, power, and circuit area in computation. The reclaimed resources can be used for increased parallel computation, other functions, communication, or memory, as well as increasedinput/output capabilities. This comes at the cost of general-purpose utility.

Emerging hardware architectures

[edit]

Greater RTL customization of hardware designs allows emerging architectures such asin-memory computing,transport triggered architectures (TTA) andnetworks-on-chip (NoC) to further benefit from increasedlocality of data to execution context, thereby reducing computing and communication latency between modules and functional units.

Custom hardware is limited in parallel processing capability only by the area andlogic blocks available on theintegrated circuit die.^[10] Therefore, hardware is much more free to offermassive parallelism than software on general-purpose processors, offering a possibility of implementing theparallel random-access machine (PRAM) model.

It is common to buildmulticore andmanycore processing units out ofmicroprocessor IP core schematics on a single FPGA or ASIC.^[11]^[12]^[13]^[14]^[15] Similarly, specialized functional units can be composed in parallel, asin digital signal processing, without being embedded in a processorIP core. Therefore, hardware acceleration is often employed for repetitive, fixed tasks involving littleconditional branching, especially on large amounts of data. This is howNvidia'sCUDA line of GPUs are implemented.

Implementation metrics

[edit]

As device mobility has increased, new metrics have been developed that measure the relative performance of specific acceleration protocols, considering characteristics such as physical hardware dimensions, power consumption, and operations throughput. These can be summarized into three categories: task efficiency, implementation efficiency, and flexibility. Appropriate metrics consider the area of the hardware along with both the corresponding operations throughput and energy consumed.^[16]

Applications

[edit]

Examples of hardware acceleration includebit blit acceleration functionality in graphics processing units (GPUs), use ofmemristors for acceleratingneural networks, andregular expression hardware acceleration forspam control in theserver industry, intended to preventregular expression denial of service (ReDoS) attacks.^[17] The hardware that performs the acceleration may be part of a general-purpose CPU, or a separate unit called a hardware accelerator, though they are usually referred to with a more specific term, such as 3D accelerator, orcryptographic accelerator.

Traditionally, processors were sequential (instructions are executed one by one), and were designed to run general purpose algorithms controlled byinstruction fetch (for example, moving temporary resultsto and from aregister file). Hardware accelerators improve the execution of a specific algorithm by allowing greaterconcurrency, having specificdatapaths for theirtemporary variables, and reducing the overhead of instruction control in the fetch-decode-execute cycle.

Modern processors aremulti-core and often feature parallel "single-instruction; multiple data" (SIMD) units. Such units can be integrated withint theCPU or offered by additional components as theAMD AI engines.^[18] Even so, hardware acceleration still yields benefits. Hardware acceleration is suitable for any computation-intensive algorithm which is executed frequently in a task or program. Depending upon the granularity, hardware acceleration can vary from a small functional unit, to a large functional block (likemotion estimation inMPEG-2).

Hardware acceleration units by application

[edit]


Application	Hardware accelerator	Acronym

Computer graphics General-purpose computing GP computing, on Nvidia graphics cards Ray tracing Video codec	Graphics processing unit General-purpose computing on GPU CUDA architecture Ray-tracing hardware Variousvideo acceleration hardware	GPU GPGPU CUDA RTX N/A

Digital signal processing	Digital signal processor	DSP
Analog signal processing	Field-programmable analog array Field-programmable RF	FPAA FPRF
Image processing	Webcam orimage processor	IPU
Sound processing	Sound card andsound card mixer	N/A
Computer networking on a chip TCP Input/output	Network processor andnetwork interface controller Network on a chip TCP offload engine IPsec offload^[19] I/O Acceleration Technology	NPU and NIC NoC TCPOE or TOE I/OAT or IOAT
Cryptography Encryption ISA SSL/TLS Attack Random number generation	Cryptographic accelerator andsecure cryptoprocessor Hardware-based encryption AES instruction set SSL acceleration Custom hardware attack Hardware random number generator	N/A
Artificial intelligence Machine vision/computer vision Neural networks Brain simulation	AI accelerator Vision processing unit Physical neural network Neuromorphic engineering	N/A VPU PNN N/A
Multilinear algebra	Tensor processing unit	TPU
Physics simulation	Physics processing unit	PPU
Regular expressions^[17]	Regular expression coprocessor	N/A
Data compression^[20]	Data compression accelerator	N/A
In-memory processing	Network on a chip andSystolic array	NoC; N/A
Data processing	Data processing unit	DPU
Any computing task	Computer hardware Field-programmable gate arrays^[21] Application-specific integrated circuits^[21] Complex programmable logic devices Systems-on-Chip Multi-processor system-on-chip Programmable system-on-chip	HW (sometimes) FPGA ASIC CPLD SoC MPSoC PSoC

References

[edit]

^"Microsoft Supercharges Bing Search With Programmable Chips".WIRED. 16 June 2014.
^"Embedded". Archived fromthe original on 2007-10-08. Retrieved2012-08-18. "FPGA Architectures from 'A' to 'Z'" by Clive Maxfield 2006
^Sinan, Kufeoglu; Mahmut, Ozkuran (2019)."Figure 5. CPU, GPU, FPGA, and ASIC minimum energy consumption between difficulty recalculation.".Energy Consumption of Bitcoin Mining.doi:10.17863/CAM.41230.
^Kim, Yeongmin; Kong, Joonho; Munir, Arslan (2020)."CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge".IEEE Access.8:211422–211433.Bibcode:2020IEEEA...8u1422K.doi:10.1109/ACCESS.2020.3039278.ISSN 2169-3536.
^Lin, Yibo; Jiang, Zixuan; Gu, Jiaqi; Li, Wuxi; Dhar, Shounak; Ren, Haoxing; Khailany, Brucek; Pan, David Z. (April 2021). "DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement".IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.40 (4):748–761.Bibcode:2021ITCAD..40..748L.doi:10.1109/TCAD.2020.3003843.ISSN 1937-4151.S2CID 225744481.
^Lyakhov, Pavel; Valueva, Maria; Valuev, Georgii; Nagornov, Nikolai (2020-12-18)."A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units".Applied Sciences.10 (24): 9052.doi:10.3390/app10249052.ISSN 2076-3417.Hardware simulation on FPGA increased the digital filter performance.
^Mohan, Prashanth; Wang, Wen; Jungk, Bernhard; Niederhagen, Ruben; Szefer, Jakub; Mai, Ken (October 2020). "ASIC Accelerator in 28 nm for the Post-Quantum Digital Signature Scheme XMSS".2020 IEEE 38th International Conference on Computer Design (ICCD). Hartford, CT, USA: IEEE. pp. 656–662.doi:10.1109/ICCD50377.2020.00112.ISBN 978-1-7281-9710-4.S2CID 229330964.
^Morgan, Timothy Pricket (2014-09-03)."How Microsoft Is Using FPGAs To Speed Up Bing Search". Enterprise Tech. Retrieved2018-09-18.^{[permanent dead link]}
^"Project Catapult".Microsoft Research.
^MicroBlaze Soft Processor: Frequently Asked Questions Archived 2011-10-27 at theWayback Machine
^Vassányi, István (1998)."Implementing processor arrays on FPGAs".Field-Programmable Logic and Applications from FPGAs to Computing Paradigm. Lecture Notes in Computer Science. Vol. 1482. pp. 446–450.doi:10.1007/BFb0055278.ISBN 978-3-540-64948-9.
^Zhoukun WANG and Omar HAMMAMI. "A 24 Processors System on Chip FPGA Design with Network on Chip".[1]
^"John Kent. "Micro16 Array - A Simple CPU Array"". Archived fromthe original on 2020-08-01. Retrieved2018-10-07.
^Kit Eaton. "1,000 Core CPU Achieved: Your Future Desktop Will Be a Supercomputer". 2011.[2]
^"Scientists Squeeze Over 1,000 Cores onto One Chip". 2011.[3]Archived 2012-03-05 at theWayback Machine
^Kienle, Frank; Wehn, Norbert; Meyr, Heinrich (December 2011). "On Complexity, Energy- and Implementation-Efficiency of Channel Decoders".IEEE Transactions on Communications.59 (12):3301–3310.arXiv:1003.3792.Bibcode:2011ITCom..59.3301K.doi:10.1109/tcomm.2011.092011.100157.ISSN 0090-6778.S2CID 13863870.
^^a ^b"Regular Expressions in hardware". Retrieved17 July 2014.
^Rico, Alejandro; Pareek, Satyaprakash; Cabezas, Javier; Clarke, David; Ozgul, Baris; Barat, Francisco; Fu, Yao; Münz, Stephan; Stuart, Dylan; Schlangen, Patrick; Duarte, Pedro; Date, Sneha; Paul, Indrani; Weng, Jian; Santan, Sonal (2024-07-10). "AMD XDNA NPU in Ryzen AI Processors".IEEE Micro.44 (6):73–82.Bibcode:2024IMicr..44f..73R.doi:10.1109/MM.2024.3423692.ISSN 1937-4143.
^"Intel® PRO/100 S Desktop Adapter Datasheet"(PDF).Intel. 2005. Retrieved15 August 2025.
^"Compression Accelerators".Microsoft Research. 16 June 2014. Retrieved15 August 2025.
^^a ^bFarabet, Clément; et al. (2010).Hardware accelerated convolutional neural networks for synthetic vision systems. Proceedings of 2010 IEEE International Symposium on Circuits and Systems. Paris:IEEE. pp. 257–260.doi:10.1109/ISCAS.2010.5537908.ISSN 2158-1525. Retrieved15 August 2025 – via Academia.edu.

External links

[edit]

Media related toHardware acceleration at Wikimedia Commons

v t e Hardware acceleration
Theory	Universal Turing machine Parallel computing Distributed computing
Applications	GPU GPGPU software DirectX Audio Digital signal processing Hardware random number generation Neural processing unit Cryptography TLS Machine vision Custom hardware attack scrypt Networking Data
Implementations	High-level synthesis C to HDL FPGA ASIC CPLD System on a chip Network on a chip
Architectures	Dataflow Transport triggered Multicore Manycore Heterogeneous In-memory computing Systolic array Neuromorphic
Related	Programmable logic Processor design chronology Digital electronics Virtualization Hardware emulation Logic synthesis Embedded systems

Graphics processing unit

GPU

Desktop	Intel GT Xe Arc Nvidia GeForce Quadro Tesla Tegra AMD Radeon Radeon Pro Instinct Matrox InfiniteReality NEC µPD7220 3dfx Voodoo S3 Glaze3D Apple silicon Jingjia Micro Tseng Labs SiS
Mobile	Adreno Apple silicon Mali PowerVR VideoCore Vivante Imageon Intel 2700G

Architecture

Components

Memory

Form factor

Performance

Misc

v t e Digital electronics
Components	Transistor Resistor Inductor Capacitor Printed electronics Printed circuit board Electronic circuit Flip-flop Memory cell Combinational logic Sequential logic Logic gate Boolean circuit Integrated circuit (IC) Hybrid integrated circuit (HIC) Mixed-signal integrated circuit Three-dimensional integrated circuit (3D IC) Emitter-coupled logic (ECL) Erasable programmable logic device (EPLD) Macrocell array Programmable logic array (PLA) Programmable logic device (PLD) Programmable Array Logic (PAL) Generic Array Logic (GAL) Complex programmable logic device (CPLD) Field-programmable gate array (FPGA) Field-programmable object array (FPOA) Application-specific integrated circuit (ASIC) Tensor Processing Unit (TPU)
Theory	Digital signal Boolean algebra Logic synthesis Logic in computer science Computer architecture Digital signal Digital signal processing Circuit minimization Switching circuit theory Gate equivalent
Design	Logic synthesis Place and route Placement Routing Transaction-level modeling Register-transfer level Hardware description language High-level synthesis Formal equivalence checking Synchronous logic Asynchronous logic Finite-state machine Hierarchical state machine
Applications	Computer hardware Hardware acceleration Digital audio radio Digital photography Digital telephone Digital video cinematography television Electronic literature
Design issues	Metastability Runt pulse