Device Power Management Basics

Copyright:© 2010-2011 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
Copyright:© 2010 Alan Stern <stern@rowland.harvard.edu>
Copyright:© 2016 Intel Corporation
Author:Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Most of the code in Linux is device drivers, so most of the Linux powermanagement (PM) code is also driver-specific. Most drivers will do verylittle; others, especially for platforms with small batteries (like cellphones), will do a lot.

This writeup gives an overview of how drivers interact with system-widepower management goals, emphasizing the models and interfaces that areshared by everything that hooks up to the driver model core. Read it asbackground for the domain-specific work you’d do with any specific driver.

Two Models for Device Power Management

Drivers will use one or both of these models to put devices into low-powerstates:

System Sleep model:

Drivers can enter low-power states as part of entering system-widelow-power states like “suspend” (also known as “suspend-to-RAM”), or(mostly for systems with disks) “hibernation” (also known as“suspend-to-disk”).

This is something that device, bus, and class drivers collaborate onby implementing various role-specific suspend and resume methods tocleanly power down hardware and software subsystems, then reactivatethem without loss of data.

Some drivers can manage hardware wakeup events, which make the systemleave the low-power state. This feature may be enabled or disabledusing the relevant/sys/devices/.../power/wakeup file (forEthernet drivers the ioctl interface used by ethtool may also be usedfor this purpose); enabling it may cost some power usage, but let thewhole system enter low-power states more often.

Runtime Power Management model:

Devices may also be put into low-power states while the system isrunning, independently of other power management activity in principle.However, devices are not generally independent of each other (forexample, a parent device cannot be suspended unless all of its childdevices have been suspended). Moreover, depending on the bus type thedevice is on, it may be necessary to carry out some bus-specificoperations on the device for this purpose. Devices put into low powerstates at run time may require special handling during system-wide powertransitions (suspend or hibernation).

For these reasons not only the device driver itself, but also theappropriate subsystem (bus type, device type or device class) driver andthe PM core are involved in runtime power management. As in the systemsleep power management case, they need to collaborate by implementingvarious role-specific suspend and resume methods, so that the hardwareis cleanly powered down and reactivated without data or service loss.

There’s not a lot to be said about those low-power states except that they arevery system-specific, and often device-specific. Also, that if enough deviceshave been put into low-power states (at runtime), the effect may be very similarto entering some system-wide low-power state (system sleep) … and thatsynergies exist, so that several drivers using runtime PM might put the systeminto a state where even deeper power saving options are available.

Most suspended devices will have quiesced all I/O: no more DMA or IRQs (exceptfor wakeup events), no more data read or written, and requests from upstreamdrivers are no longer accepted. A given bus or platform may have differentrequirements though.

Examples of hardware wakeup events include an alarm from a real time clock,network wake-on-LAN packets, keyboard or mouse activity, and media insertionor removal (for PCMCIA, MMC/SD, USB, and so on).

Interfaces for Entering System Sleep States

There are programming interfaces provided for subsystems (bus type, device type,device class) and device drivers to allow them to participate in the powermanagement of devices they are concerned with. These interfaces cover bothsystem sleep and runtime power management.

Device Power Management Operations

Device power management operations, at the subsystem level as well as at thedevice driver level, are implemented by defining and populating objects of typestructdev_pm_ops defined ininclude/linux/pm.h. The roles of themethods included in it will be explained in what follows. For now, it should besufficient to remember that the last three methods are specific to runtime powermanagement while the remaining ones are used during system-wide powertransitions.

There also is a deprecated “old” or “legacy” interface for power managementoperations available at least for some subsystems. This approach does not usestructdev_pm_ops objects and it is suitable only for implementing systemsleep power management methods in a limited way. Therefore it is not describedin this document, so please refer directly to the source code for moreinformation about it.

Subsystem-Level Methods

The core methods to suspend and resume devices reside instructdev_pm_ops pointed to by theops member ofstructdev_pm_domain, or by thepm member ofstructbus_type,structdevice_type andstructclass. They are mostly of interest to thepeople writing infrastructure for platforms and buses, like PCI or USB, ordevice type and device class drivers. They also are relevant to the writers ofdevice drivers whose subsystems (PM domains, device types, device classes andbus types) don’t provide all power management methods.

Bus drivers implement these methods as appropriate for the hardware and thedrivers using it; PCI works differently from USB, and so on. Not many peoplewrite subsystem-level drivers; most driver code is a “device driver” that buildson top of bus-specific framework code.

For more information on these driver calls, see the description later;they are called in phases for every device, respecting the parent-childsequencing in the driver model tree.

/sys/devices/.../power/wakeup files

All device objects in the driver model contain fields that control the handlingof system wakeup events (hardware signals that can force the system out of asleep state). These fields are initialized by bus or device driver code usingdevice_set_wakeup_capable() anddevice_set_wakeup_enable(),defined ininclude/linux/pm_wakeup.h.

Thepower.can_wakeup flag just records whether the device (and itsdriver) can physically support wakeup events. Thedevice_set_wakeup_capable() routine affects this flag. Thepower.wakeup field is a pointer to an object of typestructwakeup_source used for controlling whether or not the device should useits system wakeup mechanism and for notifying the PM core of system wakeupevents signaled by the device. This object is only present for wakeup-capabledevices (i.e. devices whosecan_wakeup flags are set) and is created(or removed) bydevice_set_wakeup_capable().

Whether or not a device is capable of issuing wakeup events is a hardwarematter, and the kernel is responsible for keeping track of it. By contrast,whether or not a wakeup-capable device should issue wakeup events is a policydecision, and it is managed by user space through a sysfs attribute: thepower/wakeup file. User space can write the “enabled” or “disabled”strings to it to indicate whether or not, respectively, the device is supposedto signal system wakeup. This file is only present if thepower.wakeup object exists for the given device and is created (orremoved) along with that object, bydevice_set_wakeup_capable().Reads from the file will return the corresponding string.

The initial value in thepower/wakeup file is “disabled” for themajority of devices; the major exceptions are power buttons, keyboards, andEthernet adapters whose WoL (wake-on-LAN) feature has been set up with ethtool.It should also default to “enabled” for devices that don’t generate wakeuprequests on their own but merely forward wakeup requests from one bus to another(like PCI Express ports).

Thedevice_may_wakeup() routine returns true only if thepower.wakeup object exists and the correspondingpower/wakeupfile contains the “enabled” string. This information is used by subsystems,like the PCI bus type code, to see whether or not to enable the devices’ wakeupmechanisms. If device wakeup mechanisms are enabled or disabled directly bydrivers, they also should usedevice_may_wakeup() to decide what to doduring a system sleep transition. Device drivers, however, are not expected tocalldevice_set_wakeup_enable() directly in any case.

It ought to be noted that system wakeup is conceptually different from “remotewakeup” used by runtime power management, although it may be supported by thesame physical mechanism. Remote wakeup is a feature allowing devices inlow-power states to trigger specific interrupts to signal conditions in whichthey should be put into the full-power state. Those interrupts may or may notbe used to signal system wakeup events, depending on the hardware design. Onsome systems it is impossible to trigger them from system sleep states. In anycase, remote wakeup should always be enabled for runtime power management forall devices and drivers that support it.

/sys/devices/.../power/control files

Each device in the driver model has a flag to control whether it is subject toruntime power management. This flag,runtime_auto, is initializedby the bus type (or generally subsystem) code usingpm_runtime_allow()orpm_runtime_forbid(); the default is to allow runtime powermanagement.

The setting can be adjusted by user space by writing either “on” or “auto” tothe device’spower/control sysfs file. Writing “auto” callspm_runtime_allow(), setting the flag and allowing the device to beruntime power-managed by its driver. Writing “on” callspm_runtime_forbid(), clearing the flag, returning the device to fullpower if it was in a low-power state, and preventing thedevice from being runtime power-managed. User space can check the current valueof theruntime_auto flag by reading that file.

The device’sruntime_auto flag has no effect on the handling ofsystem-wide power transitions. In particular, the device can (and in themajority of cases should and will) be put into a low-power state during asystem-wide transition to a sleep state even though itsruntime_autoflag is clear.

For more information about the runtime power management framework, refer toDocumentation/power/runtime_pm.rst.

Calling Drivers to Enter and Leave System Sleep States

When the system goes into a sleep state, each device’s driver is asked tosuspend the device by putting it into a state compatible with the targetsystem state. That’s usually some version of “off”, but the details aresystem-specific. Also, wakeup-enabled devices will usually stay partlyfunctional in order to wake the system.

When the system leaves that low-power state, the device’s driver is asked toresume it by returning it to full power. The suspend and resume operationsalways go together, and both are multi-phase operations.

For simple drivers, suspend might quiesce the device using class codeand then turn its hardware as “off” as possible during suspend_noirq. Thematching resume calls would then completely reinitialize the hardwarebefore reactivating its class I/O queues.

More power-aware drivers might prepare the devices for triggering system wakeupevents.

Call Sequence Guarantees

To ensure that bridges and similar links needing to talk to a device areavailable when the device is suspended or resumed, the device hierarchy iswalked in a bottom-up order to suspend devices. A top-down order isused to resume those devices.

The ordering of the device hierarchy is defined by the order in which devicesget registered: a child can never be registered, probed or resumed beforeits parent; and can’t be removed or suspended after that parent.

The policy is that the device hierarchy should match hardware bus topology.[Or at least the control bus, for devices which use multiple busses.]In particular, this means that a device registration may fail if the parent ofthe device is suspending (i.e. has been chosen by the PM core as the nextdevice to suspend) or has already suspended, as well as after all of the otherdevices have been suspended. Device drivers must be prepared to cope with suchsituations.

System Power Management Phases

Suspending or resuming the system is done in several phases. Different phasesare used for suspend-to-idle, shallow (standby), and deep (“suspend-to-RAM”)sleep states and the hibernation state (“suspend-to-disk”). Each phase involvesexecuting callbacks for every device before the next phase begins. Not allbuses or classes support all these callbacks and not all drivers use all thecallbacks. The various phases always run after tasks have been frozen andbefore they are unfrozen. Furthermore, the*_noirq phases run at a timewhen IRQ handlers have been disabled (except for those marked with theIRQF_NO_SUSPEND flag).

All phases use PM domain, bus, type, class or driver callbacks (that is, methodsdefined indev->pm_domain->ops,dev->bus->pm,dev->type->pm,dev->class->pm ordev->driver->pm). These callbacks are regarded by thePM core as mutually exclusive. Moreover, PM domain callbacks always takeprecedence over all of the other callbacks and, for example, type callbacks takeprecedence over bus, class and driver callbacks. To be precise, the followingrules are used to determine which callback to execute in the given phase:

  1. Ifdev->pm_domain is present, the PM core will choose the callbackprovided bydev->pm_domain->ops for execution.
  2. Otherwise, if bothdev->type anddev->type->pm are present, thecallback provided bydev->type->pm will be chosen for execution.
  3. Otherwise, if bothdev->class anddev->class->pm are present,the callback provided bydev->class->pm will be chosen forexecution.
  4. Otherwise, if bothdev->bus anddev->bus->pm are present, thecallback provided bydev->bus->pm will be chosen for execution.

This allows PM domains and device types to override callbacks provided by bustypes or device classes if necessary.

The PM domain, type, class and bus callbacks may in turn invoke device- ordriver-specific methods stored indev->driver->pm, but they don’t have to dothat.

If the subsystem callback chosen for execution is not present, the PM core willexecute the corresponding method from thedev->driver->pm set instead ifthere is one.

Entering System Suspend

When the system goes into the freeze, standby or memory sleep state,the phases are:prepare,suspend,suspend_late,suspend_noirq.

  1. Theprepare phase is meant to prevent races by preventing newdevices from being registered; the PM core would never know that all thechildren of a device had been suspended if new children could beregistered at will. [By contrast, from the PM core’s perspective,devices may be unregistered at any time.] Unlike the othersuspend-related phases, during theprepare phase the devicehierarchy is traversed top-down.

    After the->prepare callback method returns, no new children may beregistered below the device. The method may also prepare the device ordriver in some way for the upcoming system power transition, but itshould not put the device into a low-power state. Moreover, if thedevice supports runtime power management, the->prepare callbackmethod must not update its state in case it is necessary to resume itfrom runtime suspend later on.

    For devices supporting runtime power management, the return value of theprepare callback can be used to indicate to the PM core that it maysafely leave the device in runtime suspend (if runtime-suspendedalready), provided that all of the device’s descendants are also left inruntime suspend. Namely, if the prepare callback returns a positivenumber and that happens for all of the descendants of the device too,and all of them (including the device itself) are runtime-suspended, thePM core will skip thesuspend,suspend_late andsuspend_noirq phases as well as all of the corresponding phases ofthe subsequent device resume for all of these devices. In that case,the->complete callback will be the next one invoked after the->prepare callback and is entirely responsible for putting thedevice into a consistent state as appropriate.

    Note that this direct-complete procedure applies even if the device isdisabled for runtime PM; only the runtime-PM status matters. It followsthat if a device has system-sleep callbacks but does not support runtimePM, then its prepare callback must never return a positive value. Thisis because all such devices are initially set to runtime-suspended withruntime PM disabled.

    This feature also can be controlled by device drivers by using theDPM_FLAG_NO_DIRECT_COMPLETE andDPM_FLAG_SMART_PREPARE driverpower management flags. [Typically, they are set at the time the driveris probed against the device in question by passing them to thedev_pm_set_driver_flags() helper function.] If the first ofthese flags is set, the PM core will not apply the direct-completeprocedure described above to the given device and, consequenty, to anyof its ancestors. The second flag, when set, informs the middle layercode (bus types, device types, PM domains, classes) that it should takethe return value of the->prepare callback provided by the driverinto account and it may only return a positive value from its own->prepare callback if the driver’s one also has returned a positivevalue.

  2. The->suspend methods should quiesce the device to stop it fromperforming I/O. They also may save the device registers and put it intothe appropriate low-power state, depending on the bus type the device ison, and they may enable wakeup events.

    However, for devices supporting runtime power management, the->suspend methods provided by subsystems (bus types and PM domainsin particular) must follow an additional rule regarding what can be doneto the devices before their drivers’->suspend methods are called.Namely, they may resume the devices from runtime suspend bycallingpm_runtime_resume() for them, if that is necessary, butthey must not update the state of the devices in any other way at thattime (in case the drivers need to resume the devices from runtimesuspend in their->suspend methods). In fact, the PM core preventssubsystems or drivers from putting devices into runtime suspend atthese times by callingpm_runtime_get_noresume() before issuingthe->prepare callback (and callingpm_runtime_put() afterissuing the->complete callback).

  3. For a number of devices it is convenient to split suspend into the“quiesce device” and “save device state” phases, in which casessuspend_late is meant to do the latter. It is always executed afterruntime power management has been disabled for the device in question.

  4. Thesuspend_noirq phase occurs after IRQ handlers have been disabled,which means that the driver’s interrupt handler will not be called whilethe callback method is running. The->suspend_noirq methods shouldsave the values of the device’s registers that weren’t saved previouslyand finally put the device into the appropriate low-power state.

    The majority of subsystems and device drivers need not implement thiscallback. However, bus types allowing devices to share interruptvectors, like PCI, generally need it; otherwise a driver might encounteran error during the suspend phase by fielding a shared interruptgenerated by some other device after its own device had been set to lowpower.

At the end of these phases, drivers should have stopped all I/O transactions(DMA, IRQs), saved enough state that they can re-initialize or restore previousstate (as needed by the hardware), and placed the device into a low-power state.On many platforms they will gate off one or more clock sources; sometimes theywill also switch off power supplies or reduce voltages. [Drivers supportingruntime PM may already have performed some or all of these steps.]

Ifdevice_may_wakeup(dev)() returnstrue, the device should beprepared for generating hardware wakeup signals to trigger a system wakeup eventwhen the system is in the sleep state. For example,enable_irq_wake()might identify GPIO signals hooked up to a switch or other external hardware,andpci_enable_wake() does something similar for the PCI PME signal.

If any of these callbacks returns an error, the system won’t enter the desiredlow-power state. Instead, the PM core will unwind its actions by resuming allthe devices that were suspended.

Leaving System Suspend

When resuming from freeze, standby or memory sleep, the phases are:resume_noirq,resume_early,resume,complete.

  1. The->resume_noirq callback methods should perform any actionsneeded before the driver’s interrupt handlers are invoked. Thisgenerally means undoing the actions of thesuspend_noirq phase. Ifthe bus type permits devices to share interrupt vectors, like PCI, themethod should bring the device and its driver into a state in which thedriver can recognize if the device is the source of incoming interrupts,if any, and handle them correctly.

    For example, the PCI bus type’s->pm.resume_noirq() puts the deviceinto the full-power state (D0 in the PCI terminology) and restores thestandard configuration registers of the device. Then it calls thedevice driver’s->pm.resume_noirq() method to perform device-specificactions.

  2. The->resume_early methods should prepare devices for the executionof the resume methods. This generally involves undoing the actions ofthe precedingsuspend_late phase.

  3. The->resume methods should bring the device back to its operatingstate, so that it can perform normal I/O. This generally involvesundoing the actions of thesuspend phase.

  4. Thecomplete phase should undo the actions of theprepare phase.For this reason, unlike the other resume-related phases, during thecomplete phase the device hierarchy is traversed bottom-up.

    Note, however, that new children may be registered below the device assoon as the->resume callbacks occur; it’s not necessary to waituntil thecomplete phase runs.

    Moreover, if the preceding->prepare callback returned a positivenumber, the device may have been left in runtime suspend throughout thewhole system suspend and resume (its->suspend,->suspend_late,->suspend_noirq,->resume_noirq,->resume_early, and->resume callbacks may have beenskipped). In that case, the->complete callback is entirelyresponsible for putting the device into a consistent state after systemsuspend if necessary. [For example, it may need to queue up a runtimeresume request for the device for this purpose.] To check if that isthe case, the->complete callback can consult the device’spower.direct_complete flag. If that flag is set when the->complete callback is being run then the direct-complete mechanismwas used, and special actions may be required to make the device workcorrectly afterward.

At the end of these phases, drivers should be as functional as they were beforesuspending: I/O can be performed using DMA and IRQs, and the relevant clocks aregated on.

However, the details here may again be platform-specific. For example,some systems support multiple “run” states, and the mode in effect atthe end of resume might not be the one which preceded suspension.That means availability of certain clocks or power supplies changed,which could easily affect how a driver works.

Drivers need to be able to handle hardware which has been reset since all of thesuspend methods were called, for example by complete reinitialization.This may be the hardest part, and the one most protected by NDA’d documentsand chip errata. It’s simplest if the hardware state hasn’t changed sincethe suspend was carried out, but that can only be guaranteed if the targetsystem sleep entered was suspend-to-idle. For the other system sleep statesthat may not be the case (and usually isn’t for ACPI-defined system sleepstates, like S3).

Drivers must also be prepared to notice that the device has been removedwhile the system was powered down, whenever that’s physically possible.PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busseswhere common Linux platforms will see such removal. Details of how driverswill notice and handle such removals are currently bus-specific, and ofteninvolve a separate thread.

These callbacks may return an error value, but the PM core will ignore sucherrors since there’s nothing it can do about them other than printing them inthe system log.

Entering Hibernation

Hibernating the system is more complicated than putting it into sleep states,because it involves creating and saving a system image. Therefore there aremore phases for hibernation, with a different set of callbacks. These phasesalways run after tasks have been frozen and enough memory has been freed.

The general procedure for hibernation is to quiesce all devices (“freeze”),create an image of the system memory while everything is stable, reactivate alldevices (“thaw”), write the image to permanent storage, and finally shut downthe system (“power off”). The phases used to accomplish this are:prepare,freeze,freeze_late,freeze_noirq,thaw_noirq,thaw_early,thaw,complete,prepare,poweroff,poweroff_late,poweroff_noirq.

  1. Theprepare phase is discussed in the “Entering System Suspend”section above.
  2. The->freeze methods should quiesce the device so that it doesn’tgenerate IRQs or DMA, and they may need to save the values of deviceregisters. However the device does not have to be put in a low-powerstate, and to save time it’s best not to do so. Also, the device shouldnot be prepared to generate wakeup events.
  3. Thefreeze_late phase is analogous to thesuspend_late phasedescribed earlier, except that the device should not be put into alow-power state and should not be allowed to generate wakeup events.
  4. Thefreeze_noirq phase is analogous to thesuspend_noirq phasediscussed earlier, except again that the device should not be put intoa low-power state and should not be allowed to generate wakeup events.

At this point the system image is created. All devices should be inactive andthe contents of memory should remain undisturbed while this happens, so that theimage forms an atomic snapshot of the system state.

  1. Thethaw_noirq phase is analogous to theresume_noirq phasediscussed earlier. The main difference is that its methods can assumethe device is in the same state as at the end of thefreeze_noirqphase.
  2. Thethaw_early phase is analogous to theresume_early phasedescribed above. Its methods should undo the actions of the precedingfreeze_late, if necessary.
  3. Thethaw phase is analogous to theresume phase discussedearlier. Its methods should bring the device back to an operatingstate, so that it can be used for saving the image if necessary.
  4. Thecomplete phase is discussed in the “Leaving System Suspend”section above.

At this point the system image is saved, and the devices then need to beprepared for the upcoming system shutdown. This is much like suspending thembefore putting the system into the suspend-to-idle, shallow or deep sleep state,and the phases are similar.

  1. Theprepare phase is discussed above.
  2. Thepoweroff phase is analogous to thesuspend phase.
  3. Thepoweroff_late phase is analogous to thesuspend_late phase.
  4. Thepoweroff_noirq phase is analogous to thesuspend_noirq phase.

The->poweroff,->poweroff_late and->poweroff_noirq callbacksshould do essentially the same things as the->suspend,->suspend_lateand->suspend_noirq callbacks, respectively. A notable difference isthat they need not store the device register values, because the registersshould already have been stored during thefreeze,freeze_late orfreeze_noirq phases. Also, on many machines the firmware will power-downthe entire system, so it is not necessary for the callback to put the device ina low-power state.

Leaving Hibernation

Resuming from hibernation is, again, more complicated than resuming from a sleepstate in which the contents of main memory are preserved, because it requiresa system image to be loaded into memory and the pre-hibernation memory contentsto be restored before control can be passed back to the image kernel.

Although in principle the image might be loaded into memory and thepre-hibernation memory contents restored by the boot loader, in practice thiscan’t be done because boot loaders aren’t smart enough and there is noestablished protocol for passing the necessary information. So instead, theboot loader loads a fresh instance of the kernel, called “the restore kernel”,into memory and passes control to it in the usual way. Then the restore kernelreads the system image, restores the pre-hibernation memory contents, and passescontrol to the image kernel. Thus two different kernel instances are involvedin resuming from hibernation. In fact, the restore kernel may be completelydifferent from the image kernel: a different configuration and even a differentversion. This has important consequences for device drivers and theirsubsystems.

To be able to load the system image into memory, the restore kernel needs toinclude at least a subset of device drivers allowing it to access the storagemedium containing the image, although it doesn’t need to include all of thedrivers present in the image kernel. After the image has been loaded, thedevices managed by the boot kernel need to be prepared for passing control backto the image kernel. This is very similar to the initial steps involved increating a system image, and it is accomplished in the same way, usingprepare,freeze, andfreeze_noirq phases. However, the devicesaffected by these phases are only those having drivers in the restore kernel;other devices will still be in whatever state the boot loader left them.

Should the restoration of the pre-hibernation memory contents fail, the restorekernel would go through the “thawing” procedure described above, using thethaw_noirq,thaw_early,thaw, andcomplete phases, and thencontinue running normally. This happens only rarely. Most often thepre-hibernation memory contents are restored successfully and control is passedto the image kernel, which then becomes responsible for bringing the system backto the working state.

To achieve this, the image kernel must restore the devices’ pre-hibernationfunctionality. The operation is much like waking up from a sleep state (withthe memory contents preserved), although it involves different phases:restore_noirq,restore_early,restore,complete.

  1. Therestore_noirq phase is analogous to theresume_noirq phase.
  2. Therestore_early phase is analogous to theresume_early phase.
  3. Therestore phase is analogous to theresume phase.
  4. Thecomplete phase is discussed above.

The main difference fromresume[_early|_noirq] is thatrestore[_early|_noirq] must assume the device has been accessed andreconfigured by the boot loader or the restore kernel. Consequently, the stateof the device may be different from the state remembered from thefreeze,freeze_late andfreeze_noirq phases. The device may even need to bereset and completely re-initialized. In many cases this difference doesn’tmatter, so the->resume[_early|_noirq] and->restore[_early|_norq]method pointers can be set to the same routines. Nevertheless, differentcallback pointers are used in case there is a situation where it actually doesmatter.

Power Management Notifiers

There are some operations that cannot be carried out by the power managementcallbacks discussed above, because the callbacks occur too late or too early.To handle these cases, subsystems and device drivers may register powermanagement notifiers that are called before tasks are frozen and after they havebeen thawed. Generally speaking, the PM notifiers are suitable for performingactions that either require user space to be available, or at least won’tinterfere with user space.

For details refer toSuspend/Hibernation Notifiers.

Device Low-Power (suspend) States

Device low-power states aren’t standard. One device might only handle“on” and “off”, while another might support a dozen different versions of“on” (how many engines are active?), plus a state that gets back to “on”faster than from a full “off”.

Some buses define rules about what different suspend states mean. PCIgives one example: after the suspend sequence completes, a non-legacyPCI device may not perform DMA or issue IRQs, and any wakeup events itissues would be issued through the PME# bus signal. Plus, there areseveral PCI-standard device states, some of which are optional.

In contrast, integrated system-on-chip processors often use IRQs as thewakeup event sources (so drivers would callenable_irq_wake()) andmight be able to treat DMA completion as a wakeup event (sometimes DMA can stayactive too, it’d only be the CPU and some peripherals that sleep).

Some details here may be platform-specific. Systems may have devices thatcan be fully active in certain sleep states, such as an LCD display that’srefreshed using DMA while most of the system is sleeping lightly … andits frame buffer might even be updated by a DSP or other non-Linux CPU whilethe Linux control processor stays idle.

Moreover, the specific actions taken may depend on the target system state.One target system state might allow a given device to be very operational;another might require a hard shut down with re-initialization on resume.And two different target systems might use the same device in differentways; the aforementioned LCD might be active in one product’s “standby”,but a different product using the same SOC might work differently.

Device Power Management Domains

Sometimes devices share reference clocks or other power resources. In thosecases it generally is not possible to put devices into low-power statesindividually. Instead, a set of devices sharing a power resource can be putinto a low-power state together at the same time by turning off the sharedpower resource. Of course, they also need to be put into the full-power statetogether, by turning the shared power resource on. A set of devices with thisproperty is often referred to as a power domain. A power domain may also benested inside another power domain. The nested domain is referred to as thesub-domain of the parent domain.

Support for power domains is provided through thepm_domain field ofstructdevice. This field is a pointer to an object of typestructdev_pm_domain, defined ininclude/linux/pm.h, providing a setof power management callbacks analogous to the subsystem-level and device drivercallbacks that are executed for the given device during all power transitions,instead of the respective subsystem-level callbacks. Specifically, if adevice’spm_domain pointer is not NULL, the->suspend() callbackfrom the object pointed to by it will be executed instead of its subsystem’s(e.g. bus type’s)->suspend() callback and analogously for all of theremaining callbacks. In other words, power management domain callbacks, ifdefined for the given device, always take precedence over the callbacks providedby the device’s subsystem (e.g. bus type).

The support for device power management domains is only relevant to platformsneeding to use the same device driver power management callbacks in manydifferent power domain configurations and wanting to avoid incorporating thesupport for power domains into subsystem-level callbacks, for example bymodifying the platform bus type. Other platforms need not implement it or takeit into account in any way.

Devices may be defined as IRQ-safe which indicates to the PM core that theirruntime PM callbacks may be invoked with disabled interrupts (seeDocumentation/power/runtime_pm.rst for more information). If anIRQ-safe device belongs to a PM domain, the runtime PM of the domain will bedisallowed, unless the domain itself is defined as IRQ-safe. However, itmakes sense to define a PM domain as IRQ-safe only if all the devices in itare IRQ-safe. Moreover, if an IRQ-safe domain has a parent domain, the runtimePM of the parent is only allowed if the parent itself is IRQ-safe too with theadditional restriction that all child domains of an IRQ-safe parent must alsobe IRQ-safe.

Runtime Power Management

Many devices are able to dynamically power down while the system is stillrunning. This feature is useful for devices that are not being used, andcan offer significant power savings on a running system. These devicesoften support a range of runtime power states, which might use names suchas “off”, “sleep”, “idle”, “active”, and so on. Those states will in somecases (like PCI) be partially constrained by the bus the device uses, and willusually include hardware states that are also used in system sleep states.

A system-wide power transition can be started while some devices are in lowpower states due to runtime power management. The system sleep PM callbacksshould recognize such situations and react to them appropriately, but thenecessary actions are subsystem-specific.

In some cases the decision may be made at the subsystem level while in othercases the device driver may be left to decide. In some cases it may bedesirable to leave a suspended device in that state during a system-wide powertransition, but in other cases the device must be put back into the full-powerstate temporarily, for example so that its system wakeup capability can bedisabled. This all depends on the hardware and the design of the subsystem anddevice driver in question.

If it is necessary to resume a device from runtime suspend during a system-widetransition into a sleep state, that can be done by callingpm_runtime_resume() from the->suspend callback (or the->freezeor->poweroff callback for transitions related to hibernation) of either thedevice’s driver or its subsystem (for example, a bus type or a PM domain).However, subsystems must not otherwise change the runtime status of devicesfrom their->prepare and->suspend callbacks (or equivalent)beforeinvoking device drivers’->suspend callbacks (or equivalent).

TheDPM_FLAG_SMART_SUSPEND Driver Flag

Some bus types and PM domains have a policy to resume all devices from runtimesuspend upfront in their->suspend callbacks, but that may not be reallynecessary if the device’s driver can cope with runtime-suspended devices.The driver can indicate this by settingDPM_FLAG_SMART_SUSPEND inpower.driver_flags at probe time, with the assistance of thedev_pm_set_driver_flags() helper routine.

Setting that flag causes the PM core and middle-layer code(bus types, PM domains etc.) to skip the->suspend_late and->suspend_noirq callbacks provided by the driver if the device remains inruntime suspend throughout those phases of the system-wide suspend (andsimilarly for the “freeze” and “poweroff” parts of system hibernation).[Otherwise the same drivercallback might be executed twice in a row for the same device, which would notbe valid in general.] If the middle-layer system-wide PM callbacks are presentfor the device then they are responsible for skipping these driver callbacks;if not then the PM core skips them. The subsystem callback routines candetermine whether they need to skip the driver callbacks by testing the returnvalue from thedev_pm_skip_suspend() helper function.

In addition, withDPM_FLAG_SMART_SUSPEND set, the driver’s->thaw_noirqand->thaw_early callbacks are skipped in hibernation if the device remainedin runtime suspend throughout the preceding “freeze” transition. Again, if themiddle-layer callbacks are present for the device, they are responsible fordoing this, otherwise the PM core takes care of it.

TheDPM_FLAG_MAY_SKIP_RESUME Driver Flag

During system-wide resume from a sleep state it’s easiest to put devices intothe full-power state, as explained inDocumentation/power/runtime_pm.rst.[Refer to that document for more information regarding this particular issue aswell as for information on the device runtime power management framework ingeneral.] However, it often is desirable to leave devices in suspend aftersystem transitions to the working state, especially if those devices had been inruntime suspend before the preceding system-wide suspend (or analogous)transition.

To that end, device drivers can use theDPM_FLAG_MAY_SKIP_RESUME flag toindicate to the PM core and middle-layer code that they allow their “noirq” and“early” resume callbacks to be skipped if the device can be left in suspendafter system-wide PM transitions to the working state. Whether or not that isthe case generally depends on the state of the device before the given systemsuspend-resume cycle and on the type of the system transition under way.In particular, the “thaw” and “restore” transitions related to hibernation arenot affected byDPM_FLAG_MAY_SKIP_RESUME at all. [All callbacks areissued during the “restore” transition regardless of the flag settings,and whether or not any driver callbacksare skipped during the “thaw” transition depends whether or not theDPM_FLAG_SMART_SUSPEND flag is set (seeabove).In addition, a device is not allowed to remain in runtime suspend if any of itschildren will be returned to full power.]

TheDPM_FLAG_MAY_SKIP_RESUME flag is taken into account in combination withthepower.may_skip_resume status bit set by the PM core during the“suspend” phase of suspend-type transitions. If the driver or the middle layerhas a reason to prevent the driver’s “noirq” and “early” resume callbacks frombeing skipped during the subsequent system resume transition, it shouldclearpower.may_skip_resume in its->suspend,->suspend_lateor->suspend_noirq callback. [Note that the drivers settingDPM_FLAG_SMART_SUSPEND need to clearpower.may_skip_resume intheir->suspend callback in case the other two are skipped.]

Setting thepower.may_skip_resume status bit along with theDPM_FLAG_MAY_SKIP_RESUME flag is necessary, but generally not sufficient,for the driver’s “noirq” and “early” resume callbacks to be skipped. Whether ornot they should be skipped can be determined by evaluating thedev_pm_skip_resume() helper function.

If that function returnstrue, the driver’s “noirq” and “early” resumecallbacks should be skipped and the device’s runtime PM status will be set to“suspended” by the PM core. Otherwise, if the device was runtime-suspendedduring the preceding system-wide suspend transition and itsDPM_FLAG_SMART_SUSPEND is set, its runtime PM status will be set to“active” by the PM core. [Hence, the drivers that do not setDPM_FLAG_SMART_SUSPEND should not expect the runtime PM status of theirdevices to be changed from “suspended” to “active” by the PM core duringsystem-wide resume-type transitions.]

If theDPM_FLAG_MAY_SKIP_RESUME flag is not set for a device, butDPM_FLAG_SMART_SUSPEND is set and the driver’s “late” and “noirq” suspendcallbacks are skipped, its system-wide “noirq” and “early” resume callbacks, ifpresent, are invoked as usual and the device’s runtime PM status is set to“active” by the PM core before enabling runtime PM for it. In that case, thedriver must be prepared to cope with the invocation of its system-wide resumecallbacks back-to-back with its->runtime_suspend one (without theintervening->runtime_resume and system-wide suspend callbacks) and thefinal state of the device must reflect the “active” runtime PM status in thatcase. [Note that this is not a problem at all if the driver’s->suspend_late callback pointer points to the same function as its->runtime_suspend one and its->resume_early callback pointer points tothe same function as the->runtime_resume one, while none of the othersystem-wide suspend-resume callbacks of the driver are present, for example.]

Likewise, ifDPM_FLAG_MAY_SKIP_RESUME is set for a device, its driver’ssystem-wide “noirq” and “early” resume callbacks may be skipped while its “late”and “noirq” suspend callbacks may have been executed (in principle, regardlessof whether or notDPM_FLAG_SMART_SUSPEND is set). In that case, the driverneeds to be able to cope with the invocation of its->runtime_resumecallback back-to-back with its “late” and “noirq” suspend ones. [For instance,that is not a concern if the driver sets bothDPM_FLAG_SMART_SUSPEND andDPM_FLAG_MAY_SKIP_RESUME and uses the same pair of suspend/resume callbackfunctions for runtime PM and system-wide suspend/resume.]