CPU Idle Time Management

Copyright:

© 2019 Intel Corporation

Author:

Rafael J. Wysocki <rafael.j.wysocki@intel.com>

CPU Idle Time Management Subsystem

Every time one of the logical CPUs in the system (the entities that appear tofetch and execute instructions: hardware threads, if present, or processorcores) is idle after an interrupt or equivalent wakeup event, which means thatthere are no tasks to run on it except for the special “idle” task associatedwith it, there is an opportunity to save energy for the processor that itbelongs to. That can be done by making the idle logical CPU stop fetchinginstructions from memory and putting some of the processor’s functional unitsdepended on by it into an idle state in which they will draw less power.

However, there may be multiple different idle states that can be used in such asituation in principle, so it may be necessary to find the most suitable one(from the kernel perspective) and ask the processor to use (or “enter”) thatparticular idle state. That is the role of the CPU idle time managementsubsystem in the kernel, calledCPUIdle.

The design ofCPUIdle is modular and based on the code duplication avoidanceprinciple, so the generic code that in principle need not depend on the hardwareor platform design details in it is separate from the code that interacts withthe hardware. It generally is divided into three categories of functionalunits:governors responsible for selecting idle states to ask the processorto enter,drivers that pass the governors’ decisions on to the hardware andthecore providing a common framework for them.

CPU Idle Time Governors

A CPU idle time (CPUIdle) governor is a bundle of policy code invoked whenone of the logical CPUs in the system turns out to be idle. Its role is toselect an idle state to ask the processor to enter in order to save some energy.

CPUIdle governors are generic and each of them can be used on any hardwareplatform that the Linux kernel can run on. For this reason, data structuresoperated on by them cannot depend on any hardware architecture or platformdesign details as well.

The governor itself is represented by astructcpuidle_governor objectcontaining four callback pointers,enable,disable,select,reflect, arating field describedbelow, and a name (string) used for identifying it.

For the governor to be available at all, that object needs to be registeredwith theCPUIdle core by callingcpuidle_register_governor() witha pointer to it passed as the argument. If successful, that causes the core toadd the governor to the global list of available governors and, if it is theonly one in the list (that is, the list was empty before) or the value of itsrating field is greater than the value of that field for thegovernor currently in use, or the name of the new governor was passed to thekernel as the value of thecpuidle.governor= command line parameter, the newgovernor will be used from that point on (there can be only oneCPUIdlegovernor in use at a time). Also, user space can choose theCPUIdlegovernor to use at run time viasysfs.

Once registered,CPUIdle governors cannot be unregistered, so it is notpractical to put them into loadable kernel modules.

The interface betweenCPUIdle governors and the core consists of fourcallbacks:

enable
int (*enable) (struct cpuidle_driver *drv, struct cpuidle_device *dev);

The role of this callback is to prepare the governor for handling the(logical) CPU represented by thestructcpuidle_device object pointedto by thedev argument. Thestructcpuidle_driver object pointedto by thedrv argument represents theCPUIdle driver to be usedwith that CPU (among other things, it should contain the list ofstructcpuidle_state objects representing idle states that theprocessor holding the given CPU can be asked to enter).

It may fail, in which case it is expected to return a negative errorcode, and that causes the kernel to run the architecture-specificdefault code for idle CPUs on the CPU in question instead ofCPUIdleuntil the->enable() governor callback is invoked for that CPUagain.

disable
void (*disable) (struct cpuidle_driver *drv, struct cpuidle_device *dev);

Called to make the governor stop handling the (logical) CPU representedby thestructcpuidle_device object pointed to by thedevargument.

It is expected to reverse any changes made by the->enable()callback when it was last invoked for the target CPU, free all memoryallocated by that callback and so on.

select
int (*select) (struct cpuidle_driver *drv, struct cpuidle_device *dev,               bool *stop_tick);

Called to select an idle state for the processor holding the (logical)CPU represented by thestructcpuidle_device object pointed to by thedev argument.

The list of idle states to take into consideration is represented by thestates array ofstructcpuidle_state objects held by thestructcpuidle_driver object pointed to by thedrv argument (whichrepresents theCPUIdle driver to be used with the CPU at hand). Thevalue returned by this callback is interpreted as an index into thatarray (unless it is a negative error code).

Thestop_tick argument is used to indicate whether or not to stopthe scheduler tick before asking the processor to enter the selectedidle state. When thebool variable pointed to by it (which is settotrue before invoking this callback) is cleared tofalse, theprocessor will be asked to enter the selected idle state withoutstopping the scheduler tick on the given CPU (if the tick has beenstopped on that CPU already, however, it will not be restarted beforeasking the processor to enter the idle state).

This callback is mandatory (i.e. theselect callback pointerinstructcpuidle_governor must not beNULL for the registrationof the governor to succeed).

reflect
void (*reflect) (struct cpuidle_device *dev, int index);

Called to allow the governor to evaluate the accuracy of the idle stateselection made by the->select() callback (when it was invoked lasttime) and possibly use the result of that to improve the accuracy ofidle state selections in the future.

In addition,CPUIdle governors are required to take power managementquality of service (PM QoS) constraints on the processor wakeup latency intoaccount when selecting idle states. In order to obtain the current effectivePM QoS wakeup latency constraint for a given CPU, aCPUIdle governor isexpected to pass the number of the CPU tocpuidle_governor_latency_req(). Then, the governor’s->select()callback must not return the index of an indle state whoseexit_latency value is greater than the number returned by thatfunction.

CPU Idle Time Management Drivers

CPU idle time management (CPUIdle) drivers provide an interface between theother parts ofCPUIdle and the hardware.

First of all, aCPUIdle driver has to populate thestates arrayofstructcpuidle_state objects included in thestructcpuidle_driver objectrepresenting it. Going forward this array will represent the list of availableidle states that the processor hardware can be asked to enter shared by all ofthe logical CPUs handled by the given driver.

The entries in thestates array are expected to be sorted by thevalue of thetarget_residency field instructcpuidle_state inthe ascending order (that is, index 0 should correspond to the idle state withthe minimum value oftarget_residency). [Since thetarget_residency value is expected to reflect the “depth” of theidle state represented by thestructcpuidle_state object holding it, thissorting order should be the same as the ascending sorting order by the idlestate “depth”.]

Three fields instructcpuidle_state are used by the existingCPUIdlegovernors for computations related to idle state selection:

target_residency

Minimum time to spend in this idle state including the time needed toenter it (which may be substantial) to save more energy than couldbe saved by staying in a shallower idle state for the same amount oftime, in microseconds.

exit_latency

Maximum time it will take a CPU asking the processor to enter this idlestate to start executing the first instruction after a wakeup from it,in microseconds.

flags

Flags representing idle state properties. Currently, governors only usetheCPUIDLE_FLAG_POLLING flag which is set if the given objectdoes not represent a real idle state, but an interface to a software“loop” that can be used in order to avoid asking the processor to enterany idle state at all. [There are other flags used by theCPUIdlecore in special situations.]

Theenter callback pointer instructcpuidle_state, which must notbeNULL, points to the routine to execute in order to ask the processor toenter this particular idle state:

void (*enter) (struct cpuidle_device *dev, struct cpuidle_driver *drv,               int index);

The first two arguments of it point to thestructcpuidle_device objectrepresenting the logical CPU running this callback and thestructcpuidle_driver object representing the driver itself, respectively,and the last one is an index of thestructcpuidle_state entry in the driver’sstates array representing the idle state to ask the processor toenter.

The analogous->enter_s2idle() callback instructcpuidle_state is usedonly for implementing the suspend-to-idle system-wide power management feature.The difference between in and->enter() is that it must not re-enableinterrupts at any point (even temporarily) or attempt to change the states ofclock event devices, which the->enter() callback may do sometimes.

Once thestates array has been populated, the number of validentries in it has to be stored in thestate_count field of thestructcpuidle_driver object representing the driver. Moreover, if anyentries in thestates array represent “coupled” idle states (thatis, idle states that can only be asked for if multiple related logical CPUs areidle), thesafe_state_index field instructcpuidle_driver needsto be the index of an idle state that is not “coupled” (that is, one that can beasked for if only one logical CPU is idle).

In addition to that, if the givenCPUIdle driver is only going to handle asubset of logical CPUs in the system, thecpumask field in itsstructcpuidle_driver object must point to the set (mask) of CPUs that will behandled by it.

ACPUIdle driver can only be used after it has been registered. If thereare no “coupled” idle state entries in the driver’sstates array,that can be accomplished by passing the driver’sstructcpuidle_driver objecttocpuidle_register_driver(). Otherwise,cpuidle_register()should be used for this purpose.

However, it also is necessary to registerstructcpuidle_device objects forall of the logical CPUs to be handled by the givenCPUIdle driver with thehelp ofcpuidle_register_device() after the driver has been registeredandcpuidle_register_driver(), unlikecpuidle_register(),does not do that automatically. For this reason, the drivers that usecpuidle_register_driver() to register themselves must also take careof registering thestructcpuidle_device objects as needed, so it is generallyrecommended to usecpuidle_register() forCPUIdle driverregistration in all cases.

The registration of astructcpuidle_device object causes theCPUIdlesysfs interface to be created and the governor’s->enable() callback tobe invoked for the logical CPU represented by it, so it must take place afterregistering the driver that will handle the CPU in question.

CPUIdle drivers andstructcpuidle_device objects can be unregisteredwhen they are not necessary any more which allows some resources associated withthem to be released. Due to dependencies between them, all of thestructcpuidle_device objects representing CPUs handled by the givenCPUIdle driver must be unregistered, with the help ofcpuidle_unregister_device(), before callingcpuidle_unregister_driver() to unregister the driver. Alternatively,cpuidle_unregister() can be called to unregister aCPUIdle driveralong with all of thestructcpuidle_device objects representing CPUs handledby it.

CPUIdle drivers can respond to runtime system configuration changes thatlead to modifications of the list of available processor idle states (which canhappen, for example, when the system’s power source is switched from AC tobattery or the other way around). Upon a notification of such a change,aCPUIdle driver is expected to callcpuidle_pause_and_lock() toturnCPUIdle off temporarily and thencpuidle_disable_device() forall of thestructcpuidle_device objects representing CPUs affected by thatchange. Next, it can update itsstates array in accordance withthe new configuration of the system, callcpuidle_enable_device() forall of the relevantstructcpuidle_device objects and invokecpuidle_resume_and_unlock() to allowCPUIdle to be used again.