Asupercomputer operating system is anoperating system intended forsupercomputers. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as fundamental changes have occurred insupercomputer architecture.[1] While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been moving away from in-house operating systems and toward some form ofLinux,[2] with it running all the supercomputers on theTOP500 list in November 2017. In 2021, top 10 computers run for instanceRed Hat Enterprise Linux (RHEL), or some variant of it or otherLinux distribution e.g.Ubuntu.
Given that modernmassively parallel supercomputers typically separate computations from other services by using multiple types ofnodes, they usually run different operating systems on different nodes, e.g., using a small and efficientlightweight kernel such asCompute Node Kernel (CNK) orCompute Node Linux (CNL) on compute nodes, but a larger system such as a Linux distribution on server andinput/output (I/O) nodes.[3][4]
While in a traditional multi-user computer systemjob scheduling is in effect atasking problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources, as well as gracefully dealing with inevitable hardware failures when tens of thousands of processors are present.[5]
Although most modern supercomputers use the Linux operating system,[6] each manufacturer has made its own specific changes to the Linux distribution they use, and no industry standard exists, partly because the differences in hardware architectures require changes to optimize the operating system to each hardware design.[1][7]

In the early days of supercomputing, the basic architectural concepts were evolving rapidly, andsystem software had to follow hardware innovations that usually took rapid turns.[1] In the early systems, operating systems were custom tailored to each supercomputer to gain speed, yet in the rush to develop them, serious software quality challenges surfaced and in many cases the cost and complexity of system software development became as much an issue as that of hardware.[1]

In the 1980s the cost for software development atCray came to equal what they spent on hardware and that trend was partly responsible for a move away from the in-house operating systems to the adaptation of generic software.[2] The first wave in operating system changes came in the mid-1980s, as vendor specific operating systems were abandoned in favor ofUnix. Despite early skepticism, this transition proved successful.[1][2]
By the early 1990s, major changes were occurring in supercomputing system software.[1] By this time, the growing use of Unix had begun to change the way system software was viewed. The use of a high level language (C) to implement the operating system, and the reliance on standardized interfaces was in contrast to theassembly language oriented approaches of the past.[1] As hardware vendors adapted Unix to their systems, new and useful features were added to Unix, e.g., fast file systems and tunableprocess schedulers.[1] However, all the companies that adapted Unix made unique changes to it, rather than collaborating on an industry standard to create "Unix for supercomputers". This was partly because differences in their architectures required these changes to optimize Unix to each architecture.[1]
As general purpose operating systems became stable, supercomputers began to borrow and adapt critical system code from them, and relied on the rich set of secondary functions that came with them.[1] However, at the same time the size of the code for general purpose operating systems was growing rapidly. By the time Unix-based code had reached 500,000 lines long, its maintenance and use was a challenge.[1] This resulted in the move to usemicrokernels which used a minimal set of the operating system functions. Systems such asMach atCarnegie Mellon University andChorusOS atINRIA were examples of early microkernels.[1]
The separation of the operating system into separate components became necessary as supercomputers developed different types of nodes, e.g., compute nodes versus I/O nodes. Thus modern supercomputers usually run different operating systems on different nodes, e.g., using a small and efficientlightweight kernel such asCNK orCNL on compute nodes, but a larger system such as aLinux-derivative on server and I/O nodes.[3][4]

TheCDC 6600, generally considered the first supercomputer in the world, ran theChippewa Operating System, which was then deployed on various otherCDC 6000 series computers.[9] The Chippewa was a rather simplejob control oriented system derived from the earlierCDC 3000, but it influenced the laterKRONOS andSCOPE systems.[9][10]
The firstCray-1 was delivered to the Los Alamos Lab with no operating system, or any other software.[11] Los Alamos developed the application software for it, and the operating system.[11] The main timesharing system for the Cray 1, theCray Time Sharing System (CTSS), was then developed at the Livermore Labs as a direct descendant of theLivermore Time Sharing System (LTSS) for the CDC 6600 operating system from twenty years earlier.[11]
In developing supercomputers, rising software costs soon became dominant, as evidenced by the 1980s cost for software development at Cray growing to equal their cost for hardware.[2] That trend was partly responsible for a move away from the in-houseCray Operating System toUNICOS system based onUnix.[2] In 1985, theCray-2 was the first system to ship with the UNICOS operating system.[12]
Around the same time, theEOS operating system was developed byETA Systems for use in theirETA10 supercomputers.[13] Written inCybil, a Pascal-like language fromControl Data Corporation, EOS highlighted the stability problems in developing stable operating systems for supercomputers and eventually a Unix-like system was offered on the same machine.[13][14] The lessons learned from developing ETA system software included the high level of risk associated with developing a new supercomputer operating system, and the advantages of using Unix with its large extant base of system software libraries.[13]
By the middle 1990s, despite the extant investment in older operating systems, the trend was toward the use of Unix-based systems, which also facilitated the use of interactivegraphical user interfaces (GUIs) forscientific computing across multiple platforms.[15] The move toward acommodity OS had opponents, who cited the fast pace and focus of Linux development as a major obstacle against adoption.[16] As one author wrote "Linux will likely catch up, but we have large-scale systems now". Nevertheless, that trend continued to gain momentum and by 2005, virtually all supercomputers used someUnix-like OS.[17] These variants of Unix includedIBM AIX, the open sourceLinux system, and other adaptations such asUNICOS from Cray.[17] By the end of the 20th century, Linux was estimated to command the highest share of the supercomputing pie.[1][18]

The IBMBlue Gene supercomputer uses theCNK operating system on the compute nodes, but uses a modifiedLinux-based kernel called I/O Node Kernel (INK) on the I/O nodes.[3][19] CNK is alightweight kernel that runs on each node and supports a single application running for a single user on that node. For the sake of efficient operation, the design of CNK was kept simple and minimal, with physical memory being statically mapped and the CNK neither needing nor providing scheduling or context switching.[3] CNK does not even implementfile I/O on the compute node, but delegates that to dedicated I/O nodes.[19] However, given that on the Blue Gene multiple compute nodes share a single I/O node, the I/O node operating system does require multi-tasking, hence the selection of the Linux-based operating system.[3][19]
While in traditional multi-user computer systems and early supercomputers,job scheduling was in effect atask scheduling problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources.[5] It is essential to tune task scheduling, and the operating system, in different configurations of a supercomputer. A typical parallel job scheduler has amaster scheduler which instructs some number of slave schedulers to launch, monitor, and controlparallel jobs, and periodically receives reports from them about the status of job progress.[5]
Some, but not all supercomputer schedulers attempt to maintain locality of job execution. ThePBS Pro scheduler used on theCray XT3 andCray XT4 systems does not attempt to optimize locality on its three-dimensionaltorus interconnect, but simply uses the first available processor.[20] On the other hand, IBM's scheduler on the Blue Gene supercomputers aims to exploit locality and minimize network contention by assigning tasks from the same application to one or more midplanes of an 8x8x8 node group.[20] TheSlurm Workload Manager scheduler uses a best fit algorithm, and performsHilbert curve scheduling to optimize locality of task assignments.[20] Several modern supercomputers such as theTianhe-2 use Slurm, which arbitrates contention for resources across the system. Slurm isopen source, Linux-based, very scalable, and can manage thousands of nodes in a computer cluster with a sustained throughput of over 100,000 jobs per hour.[21][22]
{{cite web}}: CS1 maint: multiple names: authors list (link)