Algorithms for Modern Hardware

This is an upcoming high performance computing book titled “Algorithms for Modern Hardware” by Sergey Slotin.

Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from $O(n \log n)$ to $O(n \log \log n)$.

All book materials arehosted on GitHub, with code in aseparate repository. This isn’t a collaborative project, but any contributions and feedback are very much welcome.

FAQ

Bug/typo fixes. If you spot an error on any page, please do one of these — in the order of preference:

fix it right away by either clicking on the pencil icon on the top right on any page (opens theProse editor) or, more traditionally, by modifying the page directly on GitHub (the link to the source is also on the top right);
createan issue on GitHub;
tell me about it directly;

or leave a comment on some other website where it is being discussed — I read most ofHackerNews,CodeForces, andTwitter threads where I’m tagged.

Release date. The book is split into several parts that I plan to finish sequentially with long breaks in-between. Part I, Performance Engineering, is ~75% complete as of March 2022 and will hopefully be >95% complete by this summer.

A “release” for an open-source book like this essentially means:

finishing all essential sections and filling all the TODOs,
mostly freezing the table of contents (except for the case studies),
doing one final round of heavy copyediting (hopefully, with the help of a professional editor — I still haven’t figured out how commas work in English),
drawing illustrations (I stole a lot of those that are currently displayed),
making a print-optimized PDF and figuring out the best way to distribute it.

After that, I will mostly be fixing errors and only doing some minor edits reflecting the changes in technology or new algorithm advancements. The e-book/printed editions will most likely be sold on a “pay what you want” basis, and in any case, the web version will always be fully available online.

Pre-ordering / financially supporting the book. Due to my unfortunate citizenship and place of birth, you can’t — that is, until I find a way that at the same time complies with international sanctions, doesn’t sponsorthe war, and won’t put me in prison for tax evasion.

So, don’t bother. If you want to support this book, just share it and help fix typos — that would be enough.

Translations. The website has a separate functionality for creating and managing translations — and I’ve already been contacted by some nice people willing to translate the book into Italian and Chinese (and I will personally translate at least some of it into my native Russian).

However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when centralized translation starts.

“Translating” the Russian version. The articles hosted atru.algorithmica.org/cs/ are not about advanced performance engineering but mostly about classical computer science algorithms — without discussing how to speed them up beyond asymptotic complexity. Most of the information there is not unique and already exists in English on some other places on the internet: for example, the similar-spiritedcp-algorithms.com.

Teaching performance engineering in colleges. One of my goals for writing this book is to change the way computer science — algorithm design, to be more precise — is taught in colleges. Let me elaborate on that.

There are two highly impactful textbooks on which most computer science courses are built. Both are undoubtedly outstanding, butone of them is 50 years old, andthe other is 30 years old, andcomputers have changed a lot since then. Asymptotic complexity is not the sole deciding factor anymore. In modern practical algorithm design, you choose the approach that makes better use of different types of parallelism available in the hardware over the one that theoretically does fewer raw operations on galaxy-scale inputs.

And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as “Performance Engineering of Software Systems” from MIT, “Programming Parallel Computers” from Aalto University, and some non-academic ones like Denis Bakhvalov’s “Performance Ninja” — most computer science graduates still treat modern hardware like something from the 1990s.

What I really want to achieve is that performance engineering becomes taught right after introduction to algorithms. Writing the first comprehensive textbook on the subject is a large part of it, and this is why I rush to finish it by the summer so that the colleges can pick it up in the next academic year. But creating a new course requires more than that: you need a balanced curriculum, course infrastructure, lecture slides, lab assignments… so for some time after finishing the main book, I will be working on course materials and tools forteaching performance engineering — and I’m looking forward to collaborating with other people who want to make it a reality as well.

Part I: Performance Engineering

The first part covers the basics of computer architecture and optimization of single-threaded algorithms.

It walks through the main CPU optimization topics such as caching, SIMD, and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure.

Planned table of contents:

0. Preface1. Complexity Models 1.1. Modern Hardware 1.2. Programming Languages 1.3. Models of Computation 1.4. When to Optimize2. Computer Architecture 1.1. Instruction Set Architectures 1.2. Assembly Language 1.3. Loops and Conditionals 1.4. Functions and Recursion 1.5. Indirect Branching 1.6. Machine Code Layout 1.7. System Calls 1.8. Virtualization3. Instruction-Level Parallelism 3.1. Pipeline Hazards 3.2. The Cost of Branching 3.3. Branchless Programming 3.4. Instruction Tables 3.5. Instruction Scheduling 3.6. Throughput Computing 3.7. Theoretical Performance Limits4. Compilation 4.1. Stages of Compilation 4.2. Flags and Targets 4.3. Situational Optimizations 4.4. Contract Programming 4.5. Non-Zero-Cost Abstractions 4.6. Compile-Time Computation 4.7. Arithmetic Optimizations 4.8. What Compilers Can and Can't Do5. Profiling 5.1. Instrumentation 5.2. Statistical Profiling 5.3. Program Simulation 5.4. Machine Code Analyzers 5.5. Benchmarking 5.6. Getting Accurate Results6. Arithmetic 6.1. Floating-Point Numbers 6.2. Interval Arithmetic 6.3. Newton's Method 6.4. Fast Inverse Square Root 6.5. Integers 6.6. Integer Division 6.7. Bit Manipulation(6.8. Data Compression)7. Number Theory 7.1. Modular Inverse 7.2. Montgomery Multiplication(7.3. Finite Fields)(7.4. Error Correction) 7.5. Cryptography 7.6. Hashing 7.7. Random Number Generation8. External Memory 8.1. Memory Hierarchy 8.2. Virtual Memory 8.3. External Memory Model 8.4. External Sorting 8.5. List Ranking 8.6. Eviction Policies 8.7. Cache-Oblivious Algorithms 8.8. Spacial and Temporal Locality(8.9. B-Trees)(8.10. Sublinear Algorithms)(9.13. Memory Management)9. RAM & CPU Caches 9.1. Memory Bandwidth 9.2. Memory Latency 9.3. Cache Lines 9.4. Memory Sharing 9.5. Memory-Level Parallelism 9.6. Prefetching 9.7. Alignment and Packing 9.8. Pointer Alternatives 9.9. Cache Associativity 9.10. Memory Paging 9.11. AoS and SoA10. SIMD Parallelism 10.1. Intrinsics and Vector Types 10.2. Moving Data 10.3. Reductions 10.4. Masking and Blending 10.5. In-Register Shuffles 10.6. Auto-Vectorization and SPMD11. Algorithm Case Studies 11.1. Binary GCD(11.2. Prime Number Sieves) 11.3. Integer Factorization 11.4. Logistic Regression 11.5. Big Integers & Karatsuba Algorithm 11.6. Fast Fourier Transform 11.7. Number-Theoretic Transform 11.8. Argmin with SIMD 11.9. Prefix Sum with SIMD 11.10. Reading Decimal Integers 11.11. Writing Decimal Integers(11.12. Reading and Writing Floats)(11.13. String Searching) 11.14. Sorting 11.15. Matrix Multiplication12. Data Structure Case Studies 12.1. Binary Search 12.2. Static B-Trees(12.3. Search Trees) 12.4. Segment Trees(12.5. Tries)(12.6. Range Minimum Query) 12.7. Hash Tables(12.8. Bitmaps)(12.9. Probabilistic Filters)

Among the cool things that we will speed up:

2x faster GCD (compared tostd::gcd)
8-15x faster binary search (compared tostd::lower_bound)
5-10x faster segment trees (compared to Fenwick trees)
5x faster hash tables (compared tostd::unordered_map)
2x faster popcount (compared to repeatedly callingpopcnt)
35x faster parsing series of integers (compared toscanf)
?x faster sorting (compared tostd::sort)
2x faster sum (compared tostd::accumulate)
2-3x faster prefix sum (compared to naive implementation)
10x faster argmin (compared to naive implementation)
10x faster array searching (compared tostd::find)
15x faster search tree (compared tostd::set)
100x faster matrix multiplication (compared to “for-for-for”)
optimal word-size integer factorization (~0.4ms per 60-bit integer)
optimal Karatsuba Algorithm
optimal FFT

Volume: 450-600 pages
Release date: Q3 2022

Part II: Parallel Algorithms

Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting.

Volume: 150-200 pages
Release date: 2023-2024?

Part III: Distributed Computing

Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing.

Release date: ??? (more likely to be completed than not)

Part IV: Software & Hardware

LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators.

Release date: ??? (less likely to be completed than not)

Acknowledgements

The book is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people:

Disclaimer: Technology Choices

The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them.

To clear my conscience, I’m not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust /Carbon?, LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed.