This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "General-purpose computing on graphics processing units" – news ·newspapers ·books ·scholar ·JSTOR(February 2022) (Learn how and when to remove this message) |
General-purpose computing on graphics processing units (GPGPU, or less oftenGPGP) is the use of agraphics processing unit (GPU), which typically handles computation only forcomputer graphics, to perform computation in applications traditionally handled by thecentral processing unit (CPU).[1][2][3][4] The use of multiplevideo cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.[5]
Essentially, a GPGPUpipeline is a kind ofparallel processing between one or more GPUs and CPUs, with special accelerated instructions for processing image or other graphic forms of data. While GPUs operate at lower frequencies, they typically have many times the number ofProcessing elements. Thus, GPUs can process far more pictures and other graphical data per second than a traditional CPU. Migrating data into parallel form and then using the GPU to process it can (theoretically) create a largespeedup.
GPGPU pipelines were developed at the beginning of the 21st century forgraphics processing (e.g. for bettershaders). From thehistory of supercomputing it is well-known thatscientific computing drives the largest concentrations of Computing power in history, listed in theTOP500: the majority today utilizeGPUs.
The best-known GPGPUs areNvidia Tesla that are used forNvidia DGX, alongsideAMD Instinct and Intel Gaudi.
In principle, any arbitraryBoolean function, including addition, multiplication, and other mathematical functions, can be built up from afunctionally complete set of logic operators. In 1987,Conway's Game of Life became one of the first examples of general-purpose computing using an earlystream processor called ablitter to invoke a special sequence oflogical operations on bit vectors.[6]
General-purpose computing on GPUs became more practical and popular after about 2001, with the advent of both programmableshaders andfloating point support on graphics processors. Notably, problems involvingmatrices and/orvectors – especially two-, three-, or four-dimensional vectors – were easy to translate to a GPU, which acts with native speed and support on those types. A significant milestone for GPGPU was the year 2003 when two research groups independently discovered GPU-based approaches for the solution of general linear algebra problems on GPUs that ran faster than on CPUs.[7][8] These early efforts to use GPUs as general-purpose processors required reformulating computational problems in terms of graphics primitives, as supported by the two major APIs for graphics processors,OpenGL andDirectX. This cumbersome translation was obviated by the advent of general-purpose programming languages and APIs such asSh/RapidMind,Brook and Accelerator.[9][10][11]
These were followed by Nvidia'sCUDA, which allowed programmers to ignore the underlying graphical concepts in favor of more commonhigh-performance computing concepts.[12] Newer, hardware-vendor-independent offerings include Microsoft'sDirectCompute and Apple/Khronos Group'sOpenCL.[12] This means that modern GPGPU pipelines can leverage the speed of a GPU without requiring full and explicit conversion of the data to a graphical form.
Mark Harris, the founder of GPGPU.org, claims he coined the termGPGPU.[13]
Any language that allows the code running on the CPU to poll a GPUshader for return values, can create a GPGPU framework. Programming standards for parallel computing includeOpenCL (vendor-independent),OpenACC,OpenMP andOpenHMPP.
As of 2016[update], OpenCL is the dominant open general-purpose GPU computing language, and is an open standard defined by theKhronos Group.[citation needed] OpenCL provides across-platform GPGPU platform that additionally supports data parallel compute on CPUs. OpenCL is actively supported on Intel, AMD, Nvidia, and ARM platforms. The Khronos Group has also standardised and implementedSYCL, a higher-level programming model forOpenCL as a single-source domain specific embedded language based on pure C++11.
The dominant proprietary framework isNvidiaCUDA.[14] Nvidia launched CUDA in 2006, asoftware development kit (SDK) andapplication programming interface (API) that allows using the programming languageC to code algorithms for execution onGeForce 8 series and later GPUs.
ROCm, launched in 2016, is AMD's open-source response to CUDA. It is, as of 2022, on par with CUDA with regards to features,[citation needed] and still lacking in consumer support.[citation needed]
OpenVIDIA was developed atUniversity of Toronto between 2003–2005,[15] in collaboration with Nvidia.
Altimesh Hybridizer created byAltimesh compilesCommon Intermediate Language to CUDA binaries.[16][17] It supports generics and virtual functions.[18] Debugging and profiling is integrated withVisual Studio and Nsight.[19] It is available as a Visual Studio extension on Visual Studio Marketplace.
Microsoft introduced theDirectCompute GPU computing API, released with theDirectX 11 API.
Alea GPU,[20] created by QuantAlea,[21] introduces native GPU computing capabilities for the Microsoft .NET languagesF#[22] andC#. Alea GPU also provides a simplified GPU programming model based on GPU parallel-for and parallel aggregate using delegates and automatic memory management.[23]
MATLAB supports GPGPU acceleration using theParallel Computing Toolbox andMATLAB Distributed Computing Server,[24] and third-party packages likeJacket.
GPGPU processing is also used to simulateNewtonian physics byphysics engines,[25] and commercial implementations includeHavok Physics, FX andPhysX, both of which are typically used for computer andvideo games.
C++ Accelerated Massive Parallelism (C++ AMP) is a library that accelerates execution ofC++ code by exploiting the data-parallel hardware on GPUs.
Due to a trend of increasing power of mobile GPUs, general-purpose programming became available also on the mobile devices running majormobile operating systems.
GoogleAndroid 4.2 enabled runningRenderScript code on the mobile device GPU.[26] Renderscript has since been deprecated in favour of first OpenGL compute shaders[27] and later Vulkan Compute.[28] OpenCL is available on many Android devices, but is not officially supported by Android.[29]Apple introduced the proprietaryMetal API foriOS applications, able to execute arbitrary code through Apple's GPU compute shaders.[citation needed]
This sectionpossibly containsoriginal research. Pleaseimprove it byverifying the claims made and addinginline citations. Statements consisting only of original research should be removed.(February 2015) (Learn how and when to remove this message) |
This sectiondoes notcite anysources. Please helpimprove this section byadding citations to reliable sources. Unsourced material may be challenged andremoved.(July 2017) (Learn how and when to remove this message) |
Originally, data was simply passed one-way from acentral processing unit (CPU) to agraphics processing unit (GPU), then to adisplay device. As time progressed, however, it became valuable for GPUs to store at first simple, then complex structures of data to be passed back to the CPU that analyzed an image, or a set of scientific-data represented as a 2D or 3D format that a video card can understand. Because the GPU has access to every draw operation, it can analyze data in these forms quickly, whereas a CPU must poll every pixel or data element much more slowly, as the speed of access between a CPU and its larger pool ofrandom-access memory (or in an even worse case, ahard drive) is slower than GPUs and video cards, which typically contain smaller amounts of more expensive memory that is much faster to access. Transferring the portion of the data set to be actively analyzed to that GPU memory in the form of textures or other easily readable GPU forms results in speed increase. The distinguishing feature of a GPGPU design is the ability to transfer informationbidirectionally back from the GPU to the CPU; generally the data throughput in both directions is ideally high, resulting in amultiplier effect on the speed of a specific high-usealgorithm.
GPGPU pipelines may improve efficiency on especially large data sets and/or data containing 2D or 3D imagery. It is used in complex graphics pipelines as well asscientific computing; more so in fields with large data sets likegenome mapping, or where two- or three-dimensional analysis is useful – especially at presentbiomolecule analysis,protein study, and other complexorganic chemistry. An example of such applications isNVIDIA software suite for genome analysis.
Such pipelines can also vastly improve efficiency inimage processing andcomputer vision, among other fields; as well asparallel processing generally. Some very heavily optimized pipelines have yielded speed increases of several hundred times the original CPU-based pipeline on one high-use task.
A simple example would be a GPU program that collects data about averagelighting values as it renders some view from either a camera or a computer graphics program back to the main program on the CPU, so that the CPU can then make adjustments to the overall screen view. A more advanced example might useedge detection to return both numerical information and a processed image representing outlines to acomputer vision program controlling, say, a mobile robot. Because the GPU has fast and local hardware access to everypixel or other picture element in an image, it can analyze and average it (for the first example) or apply aSobel edge filter or otherconvolution filter (for the second) with much greater speed than a CPU, which typically must access slowerrandom-access memory copies of the graphic in question.
GPGPU as a software concept is a type ofalgorithm, not a piece of equipment. Specialized equipment designs may, however, even further enhance the efficiency of GPGPU pipelines, which traditionally perform relatively few algorithms on very large amounts of data. Massively parallelized, gigantic-data-level tasks thus may be parallelized even further via specialized setups such as rack computing (many similar, highly tailored machines built into arack), which adds a third layer – many computing units each using many CPUs to correspond to many GPUs. SomeBitcoin "miners" used such setups for high-quantity processing. Insights into the largest such systems in the world has been maintained at theTOP500 supercomputer list.
Historically, CPUs have used hardware-managedcaches, but the earlier GPUs only provided software-managed local memories. However, as GPUs are being increasingly used for general-purpose applications, state-of-the-art GPUs are being designed with hardware-managed multi-level caches which have helped the GPUs to move towards mainstream computing. For example,GeForce 200 series GT200 architecture GPUs did not feature an L2 cache, theFermi GPU has 768 KiB last-level cache, theKepler GPU has 1.5 MiB last-level cache,[30] theMaxwell GPU has 2 MiB last-level cache, and thePascal GPU has 4 MiB last-level cache.
GPUs have very largeregister files, which allow them to reduce context-switching latency. Register file size is also increasing over different GPU generations, e.g., the total register file size on Maxwell (GM200), Pascal and Volta GPUs are 6 MiB, 14 MiB and 20 MiB, respectively.[31][32] By comparison, the size of aregister file on CPUs is small, typically tens or hundreds of kilobytes.
In essence: almost all GPU workloads are inherently massively-parallel LOAD-COMPUTE-STORE in nature, such asTiled rendering. Even storing one temporary vector for further recall (LOAD-COMPUTE-STORE-COMPUTE-LOAD-COMPUTE-STORE) is so expensive due to theMemory wall problem that it is to be avoided at all costs.[33] The result is that register file sizehas to increase. In standard CPUs it is possible to introducecaches (aD-cache) to solve this problem, however these are relativrly so large that they are impractical to introduce in GPUs which would need one per Processing Element.ILLIAC IV innovatively solved the problem around 1967 by introducing a local memory per Processing Element (a PEM): a strategy copied by theAspex ASP.
GPGPUs differ greatly from each other in how much execution resources is assigned to each group of "cores" that performs a stream of operations, variously called a "streaming multiprocessor" (SM) by Nvidia, compute unit (CU) or workgroup processor (WGP) by AMD according to the microarchitecture, "Xe Core" by Intel, all designed to perform what is called a "work-group" byOpenCL.[34] Much like how CPUs may elect to implement wider vector instructions in smaller pieces (e.g. AMD Bulldozer supported the 256-bit AVX instructions by splitting them into two 128-bit operations) to save power and/or chip area,[35] GPU designers also vary the amount of execution units to fit their expected workloads.
On a GPGPU, each of the following resources can vary freely in ratio from the others: FP64 (FMA), FP32 (FMA), FP16 (FMA), Int32 Add, Int32 Mul, RCP/RSQRT. (An example can be seen in Nvidia's documentation about the execution resources found in each SM of different generation (compute capacity) of GPUs. Non-matrix FP16 is handled by the FP32 cores.[36]) GPGPUs intended for scientific computing often have higher investment into FP64, while those designed for deep learning tends to have higher investment into FP16, lower-bitwidth "packed" integer operations, and additional dedicated matrix-multiplication units ("matrix units", "tensor cores").[37][38]
It is therefore insufficient to simply qualify a GPU's computational capabilities in terms of FLOPS: FLOPS values should be separately presented for matrix vs non-matrix modes and (T)OPS figures should also be presented for integer operations.
The high performance of GPUs comes at the cost of high power consumption, which under full load is in fact as much power as the rest of the PC system combined.[39] The maximum power consumption of the Pascal series GPU (Tesla P100) was specified to be 250W.[40]
In terms of raw computing power (FLOPS, TOPS, etc.), GPUs tend to have more performance-per-watt than a typical CPU. However, it takes a well-written program and a fitting workload to extract most of this power, as most of the time (and power) would otherwise be wasted on local and host memory access.
Before CUDA was published in 2007, GPGPU was "classical" and involved repurposing graphics primitives. A standard structure of such was:
More examples are available in part 4 ofGPU Gems 2.[41]
Using GPU for numerical linear algebra began at least in 2001.[42] It had been used for Gauss-Seidel solver, conjugate gradients, etc.[43]
Computervideo cards are produced by various vendors, such asNvidia,AMD. Cards from such vendors differ on implementing data-format support, such asinteger andfloating-point formats (32-bit and 64-bit).Microsoft introduced aShader Model standard, to help rank the various features of graphic cards into a simple Shader Model version number (1.0, 2.0, 3.0, etc.).
Pre-DirectX 9 video cards only supportedpaletted or integer color types. Sometimes another alpha value is added, to be used for transparency. Common formats are:
For earlyfixed-function or limited programmability graphics (i.e., up to and including DirectX 8.1-compliant GPUs) this was sufficient because this is also the representation used in displays. This representation does have certain limitations. Given sufficient graphics processing power even graphics programmers would like to use better formats, such asfloating point data formats, to obtain effects such ashigh-dynamic-range imaging. Many GPGPU applications require floating point accuracy, which came with video cards conforming to the DirectX 9 specification.
DirectX 9 Shader Model 2.x suggested the support of two precision types: full and partial precision. Full precision support could either be FP32 or FP24 (floating point 32- or 24-bit per component) or greater, while partial precision was FP16.ATI'sRadeon R300 series of GPUs supported FP24 precision only in the programmable fragment pipeline (although FP32 was supported in the vertex processors) whileNvidia'sNV30 series supported both FP16 and FP32; other vendors such asS3 Graphics andXGI supported a mixture of formats up to FP24.
The implementations of floating point on Nvidia GPUs are mostlyIEEE compliant; however, this is not true across all vendors.[44] This has implications for correctness which are considered important to some scientific applications. While 64-bit floating point values (double precision float) are commonly available on CPUs, these are not universally supported on GPUs. Some GPU architectures sacrifice IEEE compliance, while others lack double-precision. Efforts have occurred to emulate double-precision floating point values on GPUs; however, the speed tradeoff negates any benefit to offloading the computing onto the GPU in the first place.[45]
This sectiondoes notcite anysources. Please helpimprove this section byadding citations to reliable sources. Unsourced material may be challenged andremoved.(July 2017) (Learn how and when to remove this message) |
Most operations on the GPU operate in a vectorized fashion: one operation can be performed on up to four values at once.[disputed –discuss] For example, if one color⟨R1, G1, B1⟩ is to be modulated by another color⟨R2, G2, B2⟩, the GPU can produce the resulting color⟨R1*R2, G1*G2, B1*B2⟩ in one operation. This functionality is useful in graphics because almost every basic data type is a vector (either 2-, 3-, or 4-dimensional).[citation needed] Examples include vertices, colors, normal vectors, and texture coordinates.
GPUs are originally designed specifically for graphics and thus are very restrictive in operations and programming. Due to their design, GPUs are only effective for problems that can be solved usingstream processing and the hardware can only be used in certain ways.
In the early GPGPU age, GPUs can only process independent vertices and fragments, but can process many of them in parallel. This is especially effective when the programmer wants to process many vertices or fragments in the same way. In this sense, GPUs are stream processors – processors that can operate in parallel by running one kernel on many records in a stream at once. Programmers would use graphics APIs (OpenGL orDirectX) to perform general-purpose computation.
With the introduction of theCUDA (Nvidia, 2007) andOpenCL (vendor-independent, 2008) general-purpose computing APIs, in new GPGPU codes it is no longer necessary to map the computation to graphics primitives. The stream processing nature of GPUs remains valid regardless of the APIs used. (See e.g.,[46])
Astream is simply a set of records that require similar computation. Streams provide data parallelism.Kernels are the functions that are applied to each element in the stream. In the GPUs,vertices andfragments are the elements in streams and vertex and fragment shaders are the kernels to be run on them.[dubious –discuss] For each element we can only read from the input, perform operations on it, and write to the output. It is permissible to have multiple inputs and multiple outputs, but never a piece of memory that is both readable and writable.[vague]
Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup.[47]
Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements.
There are a variety of computational resources available on the GPU:
In fact, a program can substitute a write only texture for output instead of the framebuffer. This is done either throughRender to Texture (RTT), Render-To-Backbuffer-Copy-To-Texture (RTBCTT), or the more recent stream-out.
The most common form for a stream to take in GPGPU is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation, and so on.
Since textures are used as memory, texture lookups are then used as memory reads. Certain operations can be done automatically by the GPU because of this.
Compute kernels can be thought of as the body ofloops. For example, a programmer operating on a grid on the CPU might have code that looks like this:
// Input and output grids have 10000 x 10000 or 100 million elements.voidtransform_10k_by_10k_grid(floatin[10000][10000],floatout[10000][10000]){for(intx=0;x<10000;x++){for(inty=0;y<10000;y++){// The next line is executed 100 million timesout[x][y]=do_some_hard_work(in[x][y]);}}}
On the GPU, the programmer only specifies the body of the loop as the kernel and what data to loop over by invoking geometry processing.
For accurate technical information on this topic seePredication_(computer_architecture)#SIMD,_SIMT_and_vector_predication and ILLIAC IV"branching" (the term "predicate mask" did not exist in 1967).
In sequential code it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have only recently been added to GPUs.[48] Conditional writes could be performed using a properly crafted series of arithmetic/bit operations, but looping and conditional branching were not possible.
Recent[when?] GPUs allow branching, but usually with a performance penalty. Branching should generally be avoided in inner loops, whether in CPU or GPU code, and various methods, such as static branch resolution, pre-computation, predication, loop splitting,[49] and Z-cull[50] can be used to achieve branching when hardware support does not exist.
The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant (increasing the brightness of an image). The map operation is simple to implement on the GPU. The programmer generates a fragment for each pixel on screen and applies a fragment program to each one. The result stream of the same size is stored in the output buffer.
Some computations require calculating a smaller stream (possibly a stream of only one element) from a larger stream. This is called a reduction of the stream. Generally, a reduction can be performed in multiple steps. The results from the prior step are used as the input for the current step and the range over which the operation is applied is reduced until only one stream element remains.
Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria.
The scan operation, also termedparallel prefix sum, takes in a vector (stream) of data elements and an(arbitrary) associative binary function '+' with an identity element 'i'. If the input is [a0, a1, a2, a3, ...], anexclusive scan produces the output [i, a0, a0 + a1, a0 + a1 + a2, ...], while aninclusive scan produces the output [a0, a0 + a1, a0 + a1 + a2, a0 + a1 + a2 + a3, ...] anddoes not require an identity to exist. While at first glance the operation may seem inherently serial, efficient parallel scan algorithms are possible and have been implemented on graphics processing units. The scan operation has uses in e.g., quicksort and sparse matrix-vector multiplication.[46][51][52][53]
Thescatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of thevertex, which allows the programmer to control where information is deposited on the grid. Other extensions are also possible, such as controlling how large an area the vertex affects.
The fragment processor cannot perform a direct scatter operation because the location of each fragment on the grid is fixed at the time of the fragment's creation and cannot be altered by the programmer. However, a logical scatter operation may sometimes be recast or implemented with another gather step. A scatter implementation would first emit both an output value and an output address. An immediately following gather operation uses address comparisons to see whether the output value maps to the current output slot.
In dedicatedcompute kernels, scatter can be performed by indexed writes.
Gather is the reverse of scatter. After scatter reorders elements according to a map, gather can restore the order of the elements according to the map scatter used. In dedicated compute kernels, gather may be performed by indexed reads. In other shaders, it is performed with texture-lookups.
The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is usingradix sort for integer and floating point data and coarse-grainedmerge sort and fine-grainedsorting networks for general comparable data.[54][55]
The search operation allows the programmer to find a given element within the stream, or possibly find neighbors of a specified element. Mostly the search method used isbinary search on sorted elements.
A variety of data structures can be represented on the GPU:
The following are some of the areas where GPUs have been used for general purpose computing:
GPGPU usage in Bioinformatics:[71][96]
| Application | Description | Supported features | Expected speed-up† | GPU‡ | Multi-GPU support | Release status |
|---|---|---|---|---|---|---|
| BarraCUDA | DNA, including epigenetics, sequence mapping software[97] | Alignment of short sequencing reads | 6–10x | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 0.7.107f |
| CUDASW++ | Open source software for Smith-Waterman protein database searches on GPUs | Parallel search of Smith-Waterman database | 10–50x | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 2.0.8 |
| CUSHAW | Parallelized short read aligner | Parallel, accurate long read aligner – gapped alignments to large genomes | 10x | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 1.0.40 |
| GPU-BLAST | Local search with fastk-tuple heuristic | Protein alignment according to blastp, multi CPU threads | 3–4x | T 2075, 2090, K10, K20, K20X | Single only | Available now, version 2.2.26 |
| GPU-HMMER | Parallelized local and global search with profile hidden Markov models | Parallel local and global search of hidden Markov models | 60–100x | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 2.3.2 |
| mCUDA-MEME | Ultrafast scalable motif discovery algorithm based on MEME | Scalable motif discovery algorithm based on MEME | 4–10x | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 3.0.12 |
| SeqNFind | A GPU accelerated sequence analysis toolset | Reference assembly, blast, Smith–Waterman, hmm, de novo assembly | 400x | T 2075, 2090, K10, K20, K20X | Yes | Available now |
| UGENE | Opensource Smith–Waterman for SSE/CUDA, suffix array based repeats finder and dotplot | Fast short read alignment | 6–8x | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 1.11 |
| WideLM | Fits numerous linear models to a fixed design and response | Parallel linear regression on multiple similarly-shaped models | 150x | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 0.1-1 |
| Application | Description | Supported features | Expected speed-up† | GPU‡ | Multi-GPU support | Release status |
|---|---|---|---|---|---|---|
| Abalone | Models molecular dynamics of biopolymers for simulations of proteins, DNA and ligands | Explicit and implicit solvent,hybrid Monte Carlo | 4–120x | T 2075, 2090, K10, K20, K20X | Single only | Available now, version 1.8.88 |
| ACEMD | GPU simulation of molecular mechanics force fields, implicit and explicit solvent | Written for use on GPUs | 160 ns/day GPU version only | T 2075, 2090, K10, K20, K20X | Yes | Available now |
| AMBER | Suite of programs to simulate molecular dynamics on biomolecule | PMEMD: explicit and implicit solvent | 89.44 ns/day JAC NVE | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 12 + bugfix9 |
| DL-POLY | Simulate macromolecules, polymers, ionic systems, etc. on a distributed memory parallel computer | Two-body forces, link-cell pairs, Ewald SPME forces, Shake VV | 4x | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 4.0 source only |
| CHARMM | MD package to simulate molecular dynamics on biomolecule. | Implicit (5x), explicit (2x) solvent via OpenMM | TBD | T 2075, 2090, K10, K20, K20X | Yes | In development Q4/12 |
| GROMACS | Simulate biochemical molecules with complex bond interactions | Implicit (5x), explicit (2x) solvent | 165 ns/Day DHFR | T 2075, 2090, K10, K20, K20X | Single only | Available now, version 4.6 in Q4/12 |
| HOOMD-Blue | Particle dynamics package written grounds up for GPUs | Written for GPUs | 2x | T 2075, 2090, K10, K20, K20X | Yes | Available now |
| LAMMPS | Classical molecular dynamics package | Lennard-Jones, Morse, Buckingham, CHARMM, tabulated, course grain SDK, anisotropic Gay-Bern, RE-squared, "hybrid" combinations | 3–18x | T 2075, 2090, K10, K20, K20X | Yes | Available now |
| NAMD | Designed for high-performance simulation of large molecular systems | 100M atom capable | 6.44 ns/days STMV 585x 2050s | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 2.9 |
| OpenMM | Library and application for molecular dynamics for HPC with GPUs | Implicit and explicit solvent, custom forces | Implicit: 127–213 ns/day; Explicit: 18–55 ns/day DHFR | T 2075, 2090, K10, K20, K20X | Yes | Available now, version 4.1.1 |
† Expected speedups are highly dependent on system configuration. GPU performance compared against multi-corex86 CPU socket. GPU performance benchmarked on GPU supported features and may be akernel to kernel performance comparison. For details on configuration used, view application website. Speedups as per Nvidia in-house testing or ISV's documentation.
‡ Q=Quadro GPU, T=Tesla GPU. Nvidia recommended GPUs for this application. Check with developer or ISV to obtain certification information.
Lowry is reportedly using Nvidia Tesla GPUs (graphics-processing units) programmed in the company's CUDA (Compute Unified Device Architecture) to implement the algorithms. Nvidia claims that the GPUs are approximately two orders of magnitude faster than CPU computations, reducing the processing time to less than one minute per frame.
accelerates signal integrity simulations on workstations that have Nvidia Compute Unified Device Architecture (CUDA)-based Graphics Processing Units (GPU)
During internal testing, the Tesla S1070 demonstrated a 360-fold increase in the speed of the similarity-defining algorithm when compared to the popular Intel Core 2 Duo central processor running at a clock speed of 2.6 GHz.