Scaling Linux to New Heights: the SGI Altix 3000 System
SGI recently debuted its new 64-bit,64-processor, Linux system based on the Intel Itanium 2processor—a significant announcement for the company and forLinux. This system marks the opening of a new frontier asscientists working on complex and demanding high-performancecomputing (HPC) problems can now use and deploy Linux in ways neverbefore possible. HPC environments continually push the limits ofthe operating system by requiring larger numbers of CPUs, higherI/O bandwidth and faster and more efficient parallel programmingsupport.
Early on in the system's development, SGI made the decisionto use Linux exclusively as the operating system for this newplatform. It proved to be a solid and very capable operating systemfor the technical compute environments that SGI targets. With thecombination of SGI NUMAflex global shared-memory architecture,Intel Itanium 2 processors and Linux, we were breaking records longbefore the system was introduced.
The new system, called the SGI Altix 3000, has up to 64processors and 512GB of memory. A future version will offer up to512 processors and 4TB. In this article, we explore the hardwaredesign behind the new SGI system, describe the software developmentinvolved to bring this new system to market and show how Linux canreadily scale and be deployed in the most demanding HPCenvironments.
The SGI Altix 3000 system uses Intel Itanium 2 processors andis based on the SGI NUMAflex global shared-memory architecture,which is the company's implementation of a non-uniform memoryaccess (NUMA) architecture. NUMAflex was introduced in 1996 and hassince been used in the company's renowned SGI Origin family ofservers and supercomputers based on the MIPS processor and the IRIX64-bit operating system. The NUMAflex design enables the CPU,memory, I/O, interconnect, graphics and storage to be packaged intomodular components, or bricks. These bricks can then be combinedand configured with tremendous flexibility to match a customer'sresource and workload requirements better. Leveraging thisthird-generation design, SGI was able to build the SGI Altix 3000system using the same bricks for I/O (IX- and PX-bricks), storage(D-bricks) and interconnect (router bricks/R-bricks). The primarydifference in this new system is the CPU brick (C-brick), whichcontains the Itanium 2 processors. Figure 1 shows the types ofbricks used on the SGI Altix 3000 system. Figure 2 depicts howthese bricks can be combined into two racks to make asingle-system-image 64-processor system.

Figure 2. Two Possible NUMAflex Configurations
On a well-designed and balanced hardware architecture such asNUMAflex, it is the operating system's job to ensure that users andapplications can fully exploit the hardware without being hindereddue to inefficient resource management or bottlenecks. Achievingbalanced hardware resource management on a large NUMA systemrequires starting kernel development long before the first Itanium2 processors and hardware prototype systems arrive. In this case,we also used the first-generation Itanium processors for making theCPU scaling, I/O performance and other changes to Linux necessaryfor demanding HPC environments.
The first step in preparing the software before the prototypehardware arrives is identifying, as best you can, the necessarylow-level hardware register and machine programming changes thekernel will need for system initialization and runtime. Systemmanufacturers developing custom ASICs for highly advanced systemstypically use simulation software and tools to test their hardwaredesign. Before hardware was available, we developed and usedsimulators extensively for both the system firmware and kerneldevelopment to get the system-level software ready.
When the original prototype hardware based onfirst-generation Itanium processors arrived, it was time forpower-on. One of the key milestones was powering the system on forthe first time and taking a processor out of reset, then fetchingand executing the first instructions from PROM.

Figure 3. SGI engineers celebrate power-on success.
After power-on, the fun really began with long hours andweekends in the hardware “bring-up” lab. This is where hardware,diagnostic and platform-software engineers worked together closelyto debug the system and get the processor through a series ofimportant milestones: PROM to boot prompt, Linux kernel throughinitialization, reading and mounting root, reaching single-usermode and then going into multi-user mode and then connecting to thenetwork. After that, we did the same thing all over again withmultiple processors and multiple nodes—typically pursued inparallel—with several other bring-up teams at other stations thattrail closely behind the lead team's progress.

Figure 4. During bring-up, a hardware engineer, a PROM engineer andan OS engineer discuss a bug.
Once we had Linux running on the prototype systems withfirst-generation Itanium processors, software engineers couldproceed with ensuring that Linux ran and, in particular, scaledwell on large NUMA systems. We built and used numerous in-house,first-generation Itanium-based systems to help ensure that Linuxperformed well on large systems. By early 2001, we had succeeded inrunning a 32-processor Itanium-based system—the first of itskind.

Figure 5. The author's son in front of an early 32-processorItanium-based system, Summer 2001.
These first-generation Itanium-based systems were key inhaving Linux ready for demanding HPC requirements. Well before thefirst Itanium 2 processors were available from Intel, the bulk ofthe scaling, I/O performance and other changes for Linux could bedeveloped and tested.
As one group of SGI software engineers was busy working onperformance, scaling and other issues, using prototypes withfirst-generation Itanium processors, another team of hardware andplatform-software engineers was getting the next-generation SGIC-brick with Itanium 2 processors ready for power-on to repeat thebring-up process all over again.

Figure 6. First power-on of the Itanium 2-based C-brick.
By mid-2002, the bring-up team had made excellent progress,from power-on of a single processor to running a 64-processorsystem. The 64-processor system with Itanium 2 processors againmarked the first of its kind. All this, of course, was with Linuxrunning in a single-system image.
Throughout this whole process, we passed any changes in Linuxor bugs found back to the kernel developers for inclusion in afuture release of Linux.
Other Linux developers often ask, “What kind of changes didyou have to make to get Linux to run on that size system?” or“Isn't Linux CPU scaling limited to eight or so processors?”Answering these questions involves examining further what SGI isusing as its software base, the excellent changes made by thecommunity and the other HPC-related enhancements and tools providedby SGI to help make Linux scale far beyond the perceived limit ofeight processors.
On the SGI Altix 3000 system, the system software consists ofa standard Linux distribution for Itanium processors and SGIProPack, an overlay product that provides additional features forLinux. SGI ProPack includes a newer 2.4-based Linux kernel, HPClibraries highly tuned to exploit SGI's hardware, NUMA tools anddrivers.
The 2.4-based Linux kernel used on the SGI Altix 3000 systemconsists of the standard 2.4.19 kernel for Itanium processors(kernel.org), plus otherimprovements. These improvements fall into one of three categories:general bug fixes and platform support, improvements from otherwork occurring within the Linux community and SGI changes.
The first category of kernel changes is simply ongoing fixesto bugs found during testing and the continued improvements for theunderlying platform and NUMA support. For these changes, SGI workswith the kernel team's designated maintainer to get these changesincorporated back into the mainline kernel.
The second category of kernel improvements consists of theexcellent work and performance patches developed by others withinthe community that have not been accepted officially yet or weredeferred until the 2.5 development stream. These improvements canbe found on the following VA Software SourceForge sites: “Linux onLarge Systems Foundry”(large.foundries.sourceforge.net)and the “Linux Scalability Effort Project”(sourceforge.net/projects/lse).We used the following patches from these projects: CPU scheduler,Big Kernel Lock usage reduction improvements, dcache_lock-usagereduction improvements based on the Read-Copy-Update spinlockparadigm and xtime_lock (gettimeofday) usage reduction improvementsbased on the FRlock locking paradigm.
We also configured and used the Linux device filesystem(devfs,www.atnf.csiro.au/people/rgooch/linux/docs/devfs.html)on our systems to handle large numbers of disks and I/O busses.Devfs ensures that device path names persist across reboots afterother disks or controllers are added or removed. The last thing asystem administrator of a very large system wants is to have acontroller go bad and have some 50 or more disks suddenlyrenumbered and renamed. We have found devfs to be reliable andstable in high-stress system environments with configurationsconsisting of up to 64 processors with dozens of fibre channelloops with hundreds of disks attached. Devfs is an optional part ofthe 2.4 Linux kernel, so a separate kernel patch was notneeded.
The third category of kernel change consists of improvementsby SGI that are still in the process of getting submitted intomainline Linux, were accepted after 2.4 or will probably remainseparate due to the specialized use or nature of the patch. Theseopen-source improvements can be found at the “Open Source at SGI”web site (oss.sgi.com). Theimprovements we made included: XFS filesystem software, ProcessAGGregates (PAGG), CpuMemSets (CMS), kernel debugger (kdb) and aLinux kernel crash dump (lkcd).
In addition, SGI included its SCSI subsystem and driversported from IRIX. Early tests of the Linux 2.4 SCSI I/O subsystemshowed that our customers' demanding storage needs could not be metwithout a major overhaul in this area. While mainstream kerneldevelopers are working on this for a future release, SGI needed animmediate fix for its 2.4-based kernel, so the SGI XSCSIinfrastructure and drivers from IRIX were used as an interimsolution.
Figures 7-9 illustrate some of the early performanceimprovements that were achieved with Linux on the SGI Altix 3000system using the previously described changes. Figure 7 comparesXFS to other Linux filesystems. (Note, for a more detailed study onLinux filesystem performance, see “Filesystem Performance andScalability in Linux 2.4.17”, 2002 USENIX Annual TechnicalConference, which is also available atoss.sgi.com). Figure 8compares XSCSI to SCSI in Linux 2.4, and Figure 9 shows CPUscalability using AIM7.

Figure 7. Filesystem performance comparison: AIM7 multi-user kernelworkload, 2.4.17 kernel; 28 P Itanium prototype, 14GB, 120 disks;work-in-progress, interim example; varied filesystems only, butincludes SGI enhancements and SGI tuned kernel.

Figure 8. Linux XSCSI performance example: work-in-progress,interim example using 2.4.16 kernel; 120 processes reading from 120disks (through driver only).

Figure 9. CPU scaling example with AIM7: AIM7 multi-user kernelworkload, 2.4.16 kernel; work-in-progress, interim example; SGIenhancements and SGI-tuned kernel.
While SGI is focused more toward high-performance andtechnical computing environments—where the majority of CPU cyclesis typically spent in user-level code and applications instead ofin the kernel—the AIM7 benchmark does show that Linux can stillscale well with other types of workloads common in enterpriseenvironments. For HPC application performance and scaling examplesfor Linux, see the Sidebar “Already Solving Real-WorldProblems”.
Figure 10 shows the scaling results achieved on an early SGI64-processor prototype system with Itanium 2 processors running theSTREAM Triad benchmark, which tests memory bandwidth. With thisbenchmark, SGI demonstrated near-linear scalability from two to 64processors and achieved over 120GB per second. This result marks asignificant milestone for the industry by setting a new worldrecord among a microprocessor-based system, which was achievedrunning Linux within a single-system image! This impressive resultalso demonstrates that Linux can indeed scale well beyond theperceived limitation of eight processors. For more information onSTREAM Triad, seewww.cs.virginia.edu/stream.
When you look at the list of kernel additions included in SGIProPack the list is actually surprisingly small, which speakshighly of Linux's robust original design. What is even moreimpressive is that many of these and other changes are already inthe 2.5 development kernel. At this pace, Linux is quickly evolvingas a serious HPC operating system.
SGI ProPack also includes several tools and libraries to helpimprove performance on large NUMA systems for solving a complexproblem with an application that needs large numbers of CPUs andmemory, or when multiple applications are running simultaneouslywithin the same large system. On Linux, SGI provides the commandscpuset anddplace, which give predictable andimproved CPU and memory placement control for HPC applications.These tools help unrelated jobs carve out and use the resourcesthey each need without getting into each other's way or helpprevent a smaller job from inadvertently thrashing across a largerpool of resources than it can effectively use. Therefore systemresources are used efficiently and deliver results in a consistenttime period—two characteristics critical to HPCenvironments.
Also, the SGI Message Passing Toolkit (MPT) in SGI ProPackprovides industry-standard message passing libraries optimized forSGI computers. MPT contains MPI and SHMEM APIs, which transparentlyutilize and exploit the low-level capabilities within the SGIhardware, such as its block transfer engine (BTE) for fastmemory-to-memory transfers and the hardware memory controller'sfetch operation (fetchop) support. Fetchop support enables directcommunication and synchronization between multiple MPI processeswhile eliminating the overhead associated with system calls to theoperating system.
The SGI ProPack NUMA tools, HPC libraries and additionalsoftware support layered on top of a standard Linux distributionprovide a powerful HPC software environment for big compute anddata-intensive workloads. Much like a custom ASIC on hardwareproviding the “glue logic” to leverage and use commodityprocessors, memory and I/O parts, SGI ProPack software provides the“glue logic” to leverage the Linux operating system as acommodity building block for large HPC environments.
No one believed Linux could scale so well, so soon. Bycombining Linux with SGI NUMAflex system architecture and Itanium 2processors, SGI has built the world's most powerful Linux system.Bringing the SGI Altix 3000 system to market involved a tremendousamount of work, and we consider it to be only the beginning. Theaggressive standards-based strategy that SGI has for using Linux onItanium 2-based systems is raising the bar on what Linux can dowhile providing customers an exciting, no-compromises alternativefor large HPC servers and supercomputers. SGI engineers—and theentire company for that matter—are fully committed to building onLinux capabilities and pushing the envelope even further to bringmore exciting breakthroughs and opportunities for the Linux and HPCcommunities.

Steve Neuner has been working in UNIX kernel development for the past 19 years at major computer manufacturers including MAI Basic Four, Sequent Computer Systems, Digital Equipment Corporation and SGI. Now with SGI, Steve is the Linux engineering director and has been working on Linux and Itanium-based systems since joining SGI four years ago.


