Clock sources, Clock events, sched_clock() and delay timers

This document tries to briefly explain some basic kernel timekeepingabstractions. It partly pertains to the drivers usually found indrivers/clocksource in the kernel tree, but the code may be spread outacross the kernel.

If you grep through the kernel source you will find a number of architecture-specific implementations of clock sources, clockevents and several likewisearchitecture-specific overrides of the sched_clock() function and somedelay timers.

To provide timekeeping for your platform, the clock source providesthe basic timeline, whereas clock events shoot interrupts on certain pointson this timeline, providing facilities such as high-resolution timers.sched_clock() is used for scheduling and timestamping, and delay timersprovide an accurate delay source using hardware counters.

Clock sources

The purpose of the clock source is to provide a timeline for the system thattells you where you are in time. For example issuing the command ‘date’ ona Linux system will eventually read the clock source to determine exactlywhat time it is.

Typically the clock source is a monotonic, atomic counter which will providen bits which count from 0 to (2^n)-1 and then wraps around to 0 and start over.It will ideally NEVER stop ticking as long as the system is running. Itmay stop during system suspend.

The clock source shall have as high resolution as possible, and the frequencyshall be as stable and correct as possible as compared to a real-world wallclock. It should not move unpredictably back and forth in time or miss a fewcycles here and there.

It must be immune to the kind of effects that occur in hardware where e.g.the counter register is read in two phases on the bus lowest 16 bits firstand the higher 16 bits in a second bus cycle with the counter bitspotentially being updated in between leading to the risk of very strangevalues from the counter.

When the wall-clock accuracy of the clock source isn’t satisfactory, thereare various quirks and layers in the timekeeping code for e.g. synchronizingthe user-visible time to RTC clocks in the system or against networked timeservers using NTP, but all they do basically is update an offset againstthe clock source, which provides the fundamental timeline for the system.These measures does not affect the clock source per se, they only adapt thesystem to the shortcomings of it.

The clock source struct shall provide means to translate the provided counterinto a nanosecond value as an unsigned long long (unsigned 64 bit) number.Since this operation may be invoked very often, doing this in a strictmathematical sense is not desirable: instead the number is taken as close aspossible to a nanosecond value using only the arithmetic operationsmultiply and shift, so in clocksource_cyc2ns() you find:

ns ~= (clocksource * mult) >> shift

You will find a number of helper functions in the clock source code intendedto aid in providing these mult and shift values, such asclocksource_khz2mult(), clocksource_hz2mult() that help determine themult factor from a fixed shift, and clocksource_register_hz() andclocksource_register_khz() which will help out assigning both shift and multfactors using the frequency of the clock source as the only input.

For real simple clock sources accessed from a single I/O memory locationthere is nowadays even clocksource_mmio_init() which will take a memorylocation, bit width, a parameter telling whether the counter in theregister counts up or down, and the timer clock rate, and then conjure allnecessary parameters.

Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43seconds, the code handling the clock source will have to compensate for this.That is the reason why the clock source struct also contains a ‘mask’member telling how many bits of the source are valid. This way the timekeepingcode knows when the counter will wrap around and can insert the necessarycompensation code on both sides of the wrap point so that the system timelineremains monotonic.

Clock events

Clock events are the conceptual reverse of clock sources: they take adesired time specification value and calculate the values to poke intohardware timer registers.

Clock events are orthogonal to clock sources. The same hardwareand register range may be used for the clock event, but it is essentiallya different thing. The hardware driving clock events has to be able tofire interrupts, so as to trigger events on the system timeline. On an SMPsystem, it is ideal (and customary) to have one such event driving timer perCPU core, so that each core can trigger events independently of any othercore.

You will notice that the clock event device code is based on the same basicidea about translating counters to nanoseconds using mult and shiftarithmetic, and you find the same family of helper functions again forassigning these values. The clock event driver does not need a ‘mask’attribute however: the system will not try to plan events beyond the timehorizon of the clock event.

sched_clock()

In addition to the clock sources and clock events there is a special weakfunction in the kernel called sched_clock(). This function shall return thenumber of nanoseconds since the system was started. An architecture may ormay not provide an implementation of sched_clock() on its own. If a localimplementation is not provided, the system jiffy counter will be used assched_clock().

As the name suggests, sched_clock() is used for scheduling the system,determining the absolute timeslice for a certain process in the CFS schedulerfor example. It is also used for printk timestamps when you have selected toinclude time information in printk for things like bootcharts.

Compared to clock sources, sched_clock() has to be very fast: it is calledmuch more often, especially by the scheduler. If you have to do trade-offsbetween accuracy compared to the clock source, you may sacrifice accuracyfor speed in sched_clock(). It however requires some of the same basiccharacteristics as the clock source, i.e. it should be monotonic.

The sched_clock() function may wrap only on unsigned long long boundaries,i.e. after 64 bits. Since this is a nanosecond value this will mean it wrapsafter circa 585 years. (For most practical systems this means “never”.)

If an architecture does not provide its own implementation of this function,it will fall back to using jiffies, making its maximum resolution 1/HZ of thejiffy frequency for the architecture. This will affect scheduling accuracyand will likely show up in system benchmarks.

The clock driving sched_clock() may stop or reset to zero during systemsuspend/sleep. This does not matter to the function it serves of schedulingevents on the system. However it may result in interesting timestamps inprintk().

The sched_clock() function should be callable in any context, IRQ- andNMI-safe and return a sane value in any context.

Some architectures may have a limited set of time sources and lack a nicecounter to derive a 64-bit nanosecond value, so for example on the ARMarchitecture, special helper functions have been created to provide asched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes thesame counter that is also used as clock source is used for this purpose.

On SMP systems, it is crucial for performance that sched_clock() can be calledindependently on each CPU without any synchronization performance hits.Some hardware (such as the x86 TSC) will cause the sched_clock() function todrift between the CPUs on the system. The kernel can work around this byenabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspectthat makes sched_clock() different from the ordinary clock source.

Delay timers (some architectures only)

On systems with variable CPU frequency, the various kernel delay() functionswill sometimes behave strangely. Basically these delays usually use a hardloop to delay a certain number of jiffy fractions using a “lpj” (loops perjiffy) value, calibrated on boot.

Let’s hope that your system is running on maximum frequency when this valueis calibrated: as an effect when the frequency is geared down to half thefull frequency, any delay() will be twice as long. Usually this does nothurt, as you’re commonly requesting that amount of delayor more. Butbasically the semantics are quite unpredictable on such systems.

Enter timer-based delays. Using these, a timer read may be used instead ofa hard-coded loop for providing the desired delay.

This is done by declaring a struct delay_timer and assigning the appropriatefunction pointers and rate settings for this delay timer.

This is available on some architectures like OpenRISC or ARM.