Understanding Caching

by James Bottomley
on January 1, 2004

Since the earliest days of microprocessors, system designershave been plagued by a problem in which the speed of the CPU's operationexceeded the bandwidth of the memory subsystem to which it wasconnected. To avoid wasting CPU cycles while waiting for thememory to fetch the requested data, the universally adopted solutionwas to use an area of faster (and thus more expensive) memory to cachemain memory data. This solution allowed the CPU to operate at its natural speed aslong as the data it required was available in the cache.

The purpose of this article is toexplain caching from the point of view of akernel programmer. I also explain someof the common terms used to describe caches.This article is divided intosections whose kernel programming relevance isindicated; that is, some sections explain thatcache properties are irrelevant to understandingthe essentials of how the kernel handles caching.If you're coming from an Intel IA32 background,caching is practically transparent to you.In order to write kernel code thatoperates correctly on all the architectures Linuxsupports, however, you need to know the essentials of howcaching works in general.

A Cache and Its Properties

Simply put, a cache is a place that buffers memory accesses and mayhave a copy of the data you are requesting. Usually one thinks ofcaches (there may be more than one) as being stacked; the CPU is at thetop, followed by layers of one or more caches and then the main memory. In thishierarchy, caches are quantified by their level. The cache closest tothe CPU is called level one, L1 for short, and caches increase in level untilthe main memory is reached.

A cache line is the smallest unit of memory that can betransferred to or from a cache. The essential elementsthat quantify a cache are called the read and write linewidths. These signify the minimum amount of data the cachemust read or write from the memory or cache below it. Frequently,these quantities are the same, so caches often are quantifiedsimply by the line width. Even if they differ, the longestwidth usually is called the line width.

The next property that quantifies a cache is its size. Thisnumber is an indication of how much data could be stored in the cache.Often, the performance rule of thumb is the bigger the cache, thebetter the benchmarks.

A multilevel cache can be either inclusiveor exclusive. Exclusive means a particularcache line may be present in exactly one of the cache levels and no morethan one.Inclusive means the line may be present simultaneously in more than onelevel of cache. Nothing prevents the line widths from beingdifferent in differing cache levels.

Finally, a particular cache can be either write through or write back.Write through means the cache may store a copy of the data, but the writemust be completed at the next level down before it can be signaled ascomplete to the layer above. Write back means a writemay be considered complete as soon as the data is stored in the cache.For a write back cache, as long as the written data is nottransmitted, the cache line is considered dirty,because it ultimately must be written out to the level below.

Cache Management and Coherency

One of the most basic problems with caches iscoherency.A cache line is termed coherent when the data in the line is identicalto the data stored in the main memory being cached. If this isnot true, the cache line is termed incoherent. Lack of coherency cancause two particular problems.The first problem, which may occur for all caches, is stale data.In this situation, data has changed in main memory but the cache hasn't beenupdated to reflect the change. This usually manifests itself as anincorrect read, as illustrated in Figure 1.This is a transient error, because the correct data is sitting in mainmemory; the cache simply needs to be told to bring it in.

Understanding Caching

Figure 1. Stale Data Problem

The second problem, which occurs only with write back caches,can cause actual destruction of data and is much more insidious.As illustrated in Figure 2, the data hasbeen changed in memory, and it also has been changed separately by a CPU write to thecache. Because the cache must write out one line at a time, there now isno way to reconcile the changes—either the cache line must bepurged without being written, losing the CPU's change, or the linemust be written out, thus losing the changes made to mainmemory. All programmers must avoid reaching thepoint where data destruction becomes inevitable; they can do this throughthe judicious use ofthe various cache management APIs.

Understanding Caching

Figure 2. Data Destruction by Dirty Cache Lines

Cache-Line Interference

The situation where two sets of independent data lie in the samecache line, potentially leading to the data destructiondetailed above, is termedcache-line interference.If you are laying out data structures in memory, the generalrule to avoid this situation is never, ever have datathat can be modified outside of the caches mixed with data the CPUmay ordinarily use. If you absolutely have to violate this rule,make sure all externally modifiable elements of the structure arealignedL1_CACHE_BYTES, a value set atcompile time to the largest possible cache width value for all theprocessors on which your code may run.The best thing to do is use kmalloc toallocate two separate regions. kmallocnever allocates two regions that overlap in a cacheline.

Cache Management Instructions

The most basic instruction, called an invalidate, simply ejectsthe nominated line from all caches. Any reference to data in thatline causes it to be re-fetched from main memory. Thus, the staledata problem may be resolved by invalidating the cache line beforereading the data. In Linux, such an invalidation is done with:

voiddma_cache_inv(unsigned longaddress              unsigned longsize);

whereaddress is the virtual address on which to begin,andsize is the length of data toinvalidate. Note thatsizeis rounded up automatically to a multiple of the cache line width.

For write back caches, any dirty cache line may be written out,or flushed, to main memory using:

voiddma_cache_wback(unsigned longaddress,                unsigned longsize);

This flushing must be done before anything changes the main memoryunder the dirty cache line. You therefore must issue the flush before anexternal entity (such as a PCI card) modifies main memory and issue aninvalidate after this flush but before the CPU accesses anydata that has changed.

In theory, for write back caches an invalidate kills thecache line without actually writing the data out, thus destroying thedata in the cache. A safer thing to do in this case is issue a flush andinvalidate instruction:

voiddma_cache_wback_inv(unsigned longaddress,                    unsigned longsize);

This flushes the cache lines to main memory and then invalidates themfrom the cache.

Types of Caches

This section explains how a cache actually works.The only vital piece of information you need from this section is aproperty called aliasing, which means that thesame physical address in memory may be cached in multiple distinctcache lines. How the kernel actually manages the aliases isdiscussed in the following section.

In a directly mapped cache, as shown in Figure 3, the cache is divided upinto cache lines of known width (four in the example). Each line in thecache is characterized by a unique index, so every byte in the cache isaddressed by the index of the line and offset into the line. Each indexof the cache also possesses a hidden number called the tag.

Understanding Caching

Figure 3. A Directly Mapped Cache

Every address in the system is divided into three pieces—tag,index and offset—along a power of two boundary (Figure 4).When a line is brought into the cache, the tag and index areextracted from the address. The line is stored in the cache at therequired index, and the hidden tag is stored along with the line data.When the CPU makes reference to a particular address, the cache isconsulted at the given index. If the tags match, the offsetinto the line is fetched to satisfy the address reference. If thetags do not match, the current line may be flushed back to mainmemory and the correct line brought into the cache.

Every cache-able address has one and only onecorresponding index line, which can cause problems. For instance, ifthe processor reads a sequence of addresses that accidentally happento correspond to the same cache index, the cache line must beevicted and re-fetched on each read. Such a situation easily can happen in,say,a for loop reading elements of astructure that happens to be about the same size as the cache. Fordirectly mapped caches, the index sometimes is called the cachecolor, and this problem is called the cache-linecoloring problem.

To get around the coloring problem of directly mapped caches,cache circuitry sometimes is arranged so that a cache lookup cancompare tags simultaneously in more than one cache line. In an N-wayassociative cache, each index corresponds to N cache lines (and tags);thus, we can have up to N addresses with the same index simultaneously in the cache. The added parallel cache lookup circuitry tends toincrease the cost of associativity somewhat, so it usually isfound only in higher-end CPUs.

At the very top of the range, you might find afully associative cache. This type of cache hasno index at all, and all lines are consulted at once when looking for aparticular tag.

All modern CPUs handle address translation, which means the virtual address used by the kernel or applicationto refer to memory isn't the same as the physical address where thedata actually resides.The caches can be placed before or after address translation, and sometimes in a hierarchical cache there is a mixture ofplacements. Each of these placements has different properties andfeatures as described below.

In physically indexed, physically tagged (PIPT) caches, the tag and indexof the cache are both in physical memory, that is, after virtual addresstranslation has been done. This process is nice and simple, but the disadvantage ofPIPT caches is that a validaddress translation must be in the TLB (translation lookaside buffer)of the CPU. If such a TLB entry needs to be fetched from memorybefore the address translation can be done, the advantage of cachingthe data is lost. Even if a TLB entry is present, the TLB lookup andthe cache lookup must be done sequentially, making these cachesslow.

In virtually indexed, virtually tagged (VIVT) caches, on the other hand,both the index and tag are in the virtual address space in whichthe CPU currently is operating. Although this makes cache lookupsmuch faster (no address translation needs to be done before thelookup or after, if the cache lookup is successful), theysuffer from several other problems:

  1. Virtual address translations usually are changed aspart of normal kernel operation, so the cache must pay carefulattention to changes in TLB entries (and changes in address space) andflush cache lines whose translations have changed.

  2. Even in a single address space, multiplevirtual addresses may exist for the same physical address. Each of thesevirtual addresses would be cached separately, even though theyrepresent the same data. This is called the cache-linealiasing problem.

Generally, though, the added lookup speed of a VIVT cacheoutweighs its disadvantages, making it the predominant cachetype for non-x86 CPUs.

A type of hybrid cache designed to overcome some of the shortcomingsof the VIVT cache without sacrificing too much of its speed advantageis virtually indexed, physically tagged (VIPT) caching. The index ison the virtual address, but the tag is on the physical address, so thecombination (tag, offset) must specify the full physical address. Thisrequirement causes the tags to be larger than the tags for other cachetypes.

The VIPT cache gains its speed advantage over PIPT because theaddress translation and the cache lookup now can be done in parallel.The CPU doesn't know if the cache line is valid (the tagsmatch), however, until the address translation has completed.

The disadvantages of VIVT are overcome because the tag isphysical, thus the VIPT cache automatically detects aliasing when itsees that two tags are identical in the cache. Thus, a VIPT cache maybe constructed in such a fashion that cache-line aliasing neveroccurs.

This fourth theoretical type of cache, physically indexed, virtually tagged(PIVT), is basically useless andis not discussed further.

The Aliasing Problem

Any time the kernel sets up more than one virtual mapping forthe same physical page, cache line aliasing may occur. The kernel iscareful to avoid aliasing, so it usually occurs only in oneparticular instance: when the user mmaps afile. Here, the kernel has one virtual address for pages of the filein the page cache, and the user may have one or more different virtualaddresses. This is possible because nothing prevents the userfrom mmaping the file at multiple locations.

When a file is mmaped, the kernel adds themapping to one of the inode's lists: i_mmap,for maps that cannot change the underlying data, ori_mmap_shared, for maps that can change thefile's data. The API for bringing the cache aliases of a page intosync is:

void flush_dcache_page(struct page *page);

This API must be called every time data on the page is altered by thekernel, and it should be called before reading data from the page ifpage->mapping->i_mmap_shared is not empty. Inarchitecture-specific code, flush_dcache_pageloops over the i_mmap_shared list and flushes thecache data. It then loops over the i_mmap list andinvalidates it, thus bringing all the aliases into sync.

Separate Instruction and Data Caches

In their quest for efficiency, processors often have separatecaches for the instructions they execute and the data on which they operate.Often, these caches are separate mechanisms, and a data writemay not be seen by the instruction cache. This causes problems if youare trying to execute instructions you just wrote into memory, forexample, during module loading or when using a trampoline. You must usethe following API:

voidflush_icache_range(unsigned longstart,                   unsigned longend);

to ensure that the instructions are seen by the instruction cacheprior to execution.start andend are the starting and ending addresses,respectively, of the block of memory you modified to containyour instructions.

General Cache Flushing

Two APIs globally flush the CPU caches:

void flush_cache_all(void);

and

void flush_cache_mm(struct mm_struct *mm);

These flush all the caches in the system and only the lines in thecache belonging to the particular process address spacemm.Both of these are extremely expensive operations and should beused only when absolutely necessary.

Caching in SMP Environments

When more than one CPU is in the system, alevel of caching usually exists that is unique to each CPU. Depending on thearchitecture, it may be the responsibility of the kernel to ensurethat changes in the cache of one CPU become visible to the other CPUs.Fortunately, most CPUs handle this type of coherency problem inhardware. Even if they don't, as long as you follow the APIslisted in this article, you can maintain coherency across all theCPUs.

Conclusion

I hope I've given you a brief overview of how cacheswork and how the kernel manages them. The contents of this articleshould be sufficient for you to understand caching in most kernelprogramming situations you're likely to encounter. Be aware, however,that if you get deeply into the guts of kernel cache management(particularly in the architecture-specific code), you likely willcome across concepts and APIs not discussed here.

James Bottomley is the software architect for SteelEye.He maintains the SCSIsubsystem, the Linux Voyager port and the 53c700 driver. He also hasmade contributions to PA-RISC Linux development in the area ofDMA/device model abstraction.

Load Disqus comments