Movatterモバイル変換

[0]ホーム

Jump to content

Non-uniform memory access

Edit links

From Wikipedia, the free encyclopedia

Computer memory design used in multiprocessing

The motherboard of anHP Z820 workstation with two CPU sockets, each with their own set of eightDIMM slots surrounding the socket

Non-uniform memory access (NUMA) is acomputer memory design used inmultiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its ownlocal memory faster than non-local memory (memory local to another processor or memory shared between processors).^[1] NUMA is beneficial for workloads with high memorylocality of reference and lowlock contention, because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus.^[2]

NUMA architectures logically follow in scaling fromsymmetric multiprocessing (SMP) architectures. They were developed commercially during the 1990s byUnisys,Convex Computer (laterHewlett-Packard),Honeywell Information Systems Italy (HISI) (laterGroupe Bull),Silicon Graphics (laterSilicon Graphics International),Sequent Computer Systems (laterIBM),Data General (laterEMC, nowDell Technologies),Digital (laterCompaq, thenHP, nowHPE) andICL. Techniques developed by these companies later featured in a variety ofUnix-like operating systems, and to an extent inWindows NT.

The first commercial implementation of a NUMA-based Unix system was^[where?]^[when?] the Symmetrical Multi Processing XPS-100 family of servers, designed by Dan Gielan of VAST Corporation forHoneywell Information Systems Italy.

Overview

[edit]

One possible architecture of a NUMA system. The processors connect to the bus or crossbar by connections of varying number. This shows that different CPUs have different access priorities to memory based on their relative location.

Modern CPUs operate considerably faster than the main memory they use. In the early days of computing and data processing, the CPU generally ran slower than its own memory. The performance lines of processors and memory crossed in the 1960s with the advent of the firstsupercomputers. Since then, CPUs increasingly have found themselves "starved for data" and forced to stall to wait for data to arrive from memory (e.g. forVon-Neumann architecture-based computers, seeVon Neumann bottleneck). Many supercomputer designs of the 1980s and 1990s focused on providing high-speed memory access as opposed to faster processors, allowing the computers to work on large data sets at speeds other systems could not approach.

Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this meant installing an ever-increasing amount of high-speedcache memory and using increasingly sophisticated algorithms to avoidcache misses. But the dramatic increase in size of both the operating systems and the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems without NUMA make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access the computer's memory at a time.^[3]

NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common forservers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks).^[4] Another approach to addressing this problem is themulti-channel memory architecture, in which a linear increase in the number of memory channels increases the memory access concurrency linearly.^[5]

Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between memory banks. This operation slows the processors attached to those banks, so the overall speed increase due to NUMA heavily depends on the nature of the running tasks.^[4]

Implementations

[edit]

AMD implemented NUMA with itsOpteron processor (2003), usingHyperTransport.Intel announced NUMA compatibility for its x86 andItanium servers in late 2007 with itsNehalem andTukwila CPUs.^[6] Both Intel CPU families share a commonchipset; the interconnection is called IntelQuickPath Interconnect (QPI), which provides extremely high bandwidth to enable high on-board scalability and was replaced by a new version called IntelUltraPath Interconnect with the release ofSkylake (2017).^[7]

Cache coherent NUMA (ccNUMA)

[edit]

Topology of a ccNUMABulldozer server, extracted using hwloc's lstopo tool

Further information:Directory-based cache coherence

Nearly all CPU architectures use a small amount of very fast non-shared memory known ascache to exploitlocality of reference in memory accesses. With NUMA, maintainingcache coherence across shared memory has a significant overhead. Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standardvon Neumann architecture programming model.^[8]

Typically, ccNUMA uses inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA may perform poorly when multiple processors attempt to access the same memory area in rapid succession. Support for NUMA inoperating systems attempts to reduce the frequency of this kind of access by allocating processors and memory in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary.^[9]

Alternatively, cache coherency protocols such as theMESIF protocol attempt to reduce the communication required to maintain cache coherency.Scalable Coherent Interface (SCI) is anIEEE standard defining a directory-based cache coherency protocol to avoid scalability limitations found in earlier multiprocessor systems. For example, SCI is used as the basis for the NumaConnect technology.^[10]^[11]

NUMA vs. cluster computing

[edit]

One can view NUMA as a tightly coupled form ofcluster computing. The addition ofvirtual memory paging to a cluster architecture can allow the implementation of NUMA entirely in software. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater (slower) than that of hardware-based NUMA.^[2]

Software support

[edit]

Since NUMA largely influences memory access performance, certain software optimizations are needed to allow scheduling threads and processes close to their in-memory data.

Microsoft Windows 7 andWindows Server 2008 R2 added support for NUMA architecture over 64 logical cores.^[12]
Java 7 added support for NUMA-aware memory allocator andgarbage collector.^[13]
Linux kernel:
- Version 2.5 provided a basic NUMA support,^[14] which was further improved in subsequent kernel releases.
- Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases.^[15]^[16]
- Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as havingmemory pages shared between processes, or the use of transparenthuge pages; newsysctl settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.^[17]^[18]^[19]
OpenSolaris models NUMA architecture with lgroups.
FreeBSD added support for NUMA architecture in version 9.0.^[20]
Silicon Graphics IRIX (discontinued as of 2013) support for ccNUMA architecture over 1240 CPU with Origin server series.

Hardware support

[edit]

As of 2011,^[update] ccNUMA systems are multiprocessor systems based on theAMD Opteron processor, which can be implemented without external logic, and the IntelItanium processor, which requires the chipset to support NUMA. Examples of ccNUMA-enabled chipsets are the SGI Shub (Super hub), the Intel E8870, theHP sx2000 (used in the Integrity and Superdome servers), and those found in NEC Itanium-based systems. Earlier ccNUMA systems such as those fromSilicon Graphics were based onMIPS processors and theDEC Alpha 21364 (EV7) processor.

References

[edit]

^This article is based on material taken fromNon-uniform+memory+access at theFree On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of theGFDL, version 1.3 or later.
^^a ^bManchanda, Nakul; Anand, Karan (2010-05-04)."Non-Uniform Memory Access (NUMA)"(PDF). New York University. Archived fromthe original(PDF) on 2013-12-28. Retrieved2014-01-27.
^Sergey Blagodurov; Sergey Zhuravlev; Mohammad Dashti; Alexandra Fedorov (2011-05-02)."A Case for NUMA-aware Contention Management on Multicore Systems"(PDF). Simon Fraser University. Retrieved2014-01-27.
^^a ^bZoltan Majo; Thomas R. Gross (2011)."Memory System Performance in a NUMA Multicore Multiprocessor"(PDF). ACM. Archived fromthe original(PDF) on 2013-06-12. Retrieved2014-01-27.
^"Intel Dual-Channel DDR Memory Architecture White Paper"(PDF) (Rev. 1.0 ed.). Infineon Technologies North America and Kingston Technology. September 2003. Archived fromthe original(PDF, 1021 KB) on 2011-09-29. Retrieved2007-09-06.
^Intel Corp. (2008). Intel QuickPath Architecture [White paper]. Retrieved fromhttp://www.intel.com/pressroom/archive/reference/whitepaper_QuickPath.pdf
^"Gelsinger Speaks To Intel And High-Tech Industry's Rapid Technology Cadence" (Press release). Intel Corporation. September 18, 2007. RetrievedMarch 29, 2025.
^"ccNUMA: Cache Coherent Non-Uniform Memory Access".slideshare.net. 2014. Retrieved2014-01-27.
^Per Stenstromt; Truman Joe; Anoop Gupta (2002)."Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures"(PDF). ACM. Retrieved2014-01-27.
^Gustavson, David B. (September 1991)."The Scalable Coherent Interface and Related Standards Projects"(PDF).SLAC Publication 5656.Stanford Linear Accelerator Center.Archived(PDF) from the original on 2022-10-09. RetrievedJanuary 27, 2014.
^"The NumaChip enables cache coherent low cost shared memory".Numascale.com. Archived fromthe original on 2014-01-22. Retrieved2014-01-27.
^NUMA Support (MSDN)
^Java HotSpot Virtual Machine Performance Enhancements
^"Linux Scalability Effort: NUMA Group Homepage".SourceForge.net. 2002-11-20. Retrieved2014-02-06.
^"Linux kernel 3.8, Section 1.8. Automatic NUMA balancing".kernelnewbies.org. 2013-02-08. Retrieved2014-02-06.
^Jonathan Corbet (2012-11-14)."NUMA in a hurry".LWN.net. Retrieved2014-02-06.
^"Linux kernel 3.13, Section 1.6. Improved performance in NUMA systems".kernelnewbies.org. 2014-01-19. Retrieved2014-02-06.
^"Linux kernel documentation: Documentation/sysctl/kernel.txt".kernel.org. Retrieved2014-02-06.
^Jonathan Corbet (2013-10-01)."NUMA scheduling progress".LWN.net. Retrieved2014-02-06.
^"numa(4)".www.freebsd.org. Retrieved2020-12-03.

External links

[edit]

v t e Parallel computing
General	Distributed computing Parallel computing Parallel algorithm Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing