Energy Model of devices

1. Overview

The Energy Model (EM) framework serves as an interface between drivers knowingthe power consumed by devices at various performance levels, and the kernelsubsystems willing to use that information to make energy-aware decisions.

The source of the information about the power consumed by devices can vary greatlyfrom one platform to another. These power costs can be estimated usingdevicetree data in some cases. In others, the firmware will know better.Alternatively, userspace might be best positioned. And so on. In order to avoideach and every client subsystem to re-implement support for each and everypossible source of information on its own, the EM framework intervenes as anabstraction layer which standardizes the format of power cost tables in thekernel, hence enabling to avoid redundant work.

The power values might be expressed in micro-Watts or in an ‘abstract scale’.Multiple subsystems might use the EM and it is up to the system integrator tocheck that the requirements for the power value scale types are met. An examplecan be found in the Energy-Aware Scheduler documentationEnergy Aware Scheduling. For some subsystems like thermal orpowercap power values expressed in an ‘abstract scale’ might cause issues.These subsystems are more interested in estimation of power used in the past,thus the real micro-Watts might be needed. An example of these requirements canbe found in the Intelligent Power Allocation inPower allocator governor tunables.Kernel subsystems might implement automatic detection to check whether EMregistered devices have inconsistent scale (based on EM internal flag).Important thing to keep in mind is that when the power values are expressed inan ‘abstract scale’ deriving real energy in micro-Joules would not be possible.

The figure below depicts an example of drivers (Arm-specific here, but theapproach is applicable to any architecture) providing power costs to the EMframework, and interested clients reading the data from it:

+---------------+  +-----------------+  +---------------+| Thermal (IPA) |  | Scheduler (EAS) |  |     Other     |+---------------+  +-----------------+  +---------------+        |                   | em_cpu_energy()   |        |                   | em_cpu_get()      |        +---------+         |         +---------+                  |         |         |                  v         v         v                 +---------------------+                 |    Energy Model     |                 |     Framework       |                 +---------------------+                    ^       ^       ^                    |       |       | em_dev_register_perf_domain()         +----------+       |       +---------+         |                  |                 | +---------------+  +---------------+  +--------------+ |  cpufreq-dt   |  |   arm_scmi    |  |    Other     | +---------------+  +---------------+  +--------------+         ^                  ^                 ^         |                  |                 | +--------------+   +---------------+  +--------------+ | Device Tree  |   |   Firmware    |  |      ?       | +--------------+   +---------------+  +--------------+

In case of CPU devices the EM framework manages power cost tables per‘performance domain’ in the system. A performance domain is a group of CPUswhose performance is scaled together. Performance domains generally have a1-to-1 mapping with CPUFreq policies. All CPUs in a performance domain arerequired to have the same micro-architecture. CPUs in different performancedomains can have different micro-architectures.

To better reflect power variation due to static power (leakage) the EMsupports runtime modifications of the power values. The mechanism relies onRCU to free the modifiable EM perf_state table memory. Its user, the taskscheduler, also uses RCU to access this memory. The EM framework providesAPI for allocating/freeing the new memory for the modifiable EM table.The old memory is freed automatically using RCU callback mechanism when thereare no owners anymore for the given EM runtime table instance. This is trackedusing kref mechanism. The device driver which provided the new EM at runtime,should call EM API to free it safely when it’s no longer needed. The EMframework will handle the clean-up when it’s possible.

The kernel code which want to modify the EM values is protected from concurrentaccess using a mutex. Therefore, the device driver code must run in sleepingcontext when it tries to modify the EM.

With the runtime modifiable EM we switch from a ‘single and during the entireruntime static EM’ (system property) design to a ‘single EM which can bechanged during runtime according e.g. to the workload’ (system and workloadproperty) design.

It is possible also to modify the CPU performance values for each EM’sperformance state. Thus, the full power and performance profile (whichis an exponential curve) can be changed according e.g. to the workloador system property.

2. Core APIs

2.1 Config options

CONFIG_ENERGY_MODEL must be enabled to use the EM framework.

2.2 Registration of performance domains

Registration of ‘advanced’ EM

The ‘advanced’ EM gets its name due to the fact that the driver is allowedto provide more precised power model. It’s not limited to some implemented mathformula in the framework (like it is in ‘simple’ EM case). It can better reflectthe real power measurements performed for each performance state. Thus, thisregistration method should be preferred in case considering EM static power(leakage) is important.

Drivers are expected to register performance domains into the EM framework bycalling the following API:

int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,              struct em_data_callback *cb, cpumask_t *cpus, bool microwatts);

Drivers must provide a callback function returning <frequency, power> tuplesfor each performance state. The callback function provided by the driver is freeto fetch data from any relevant location (DT, firmware, ...), and by any meandeemed necessary. Only for CPU devices, drivers must specify the CPUs of theperformance domains using cpumask. For other devices than CPUs the lastargument must be set to NULL.The last argument ‘microwatts’ is important to set with correct value. Kernelsubsystems which use EM might rely on this flag to check if all EM devices usethe same scale. If there are different scales, these subsystems might decideto return warning/error, stop working or panic.See Section 3. for an example of driver implementing thiscallback, or Section 2.4 for further documentation on this API

Registration of EM using DT

The EM can also be registered using OPP framework and information in DT“operating-points-v2”. Each OPP entry in DT can be extended with a property“opp-microwatt” containing micro-Watts power value. This OPP DT propertyallows a platform to register EM power values which are reflecting total power(static + dynamic). These power values might be coming directly fromexperiments and measurements.

Registration of ‘artificial’ EM

There is an option to provide a custom callback for drivers missing detailedknowledge about power value for each performance state. The callback.get_cost() is optional and provides the ‘cost’ values used by the EAS.This is useful for platforms that only provide information on relativeefficiency between CPU types, where one could use the information tocreate an abstract power model. But even an abstract power model cansometimes be hard to fit in, given the input power value size restrictions.The .get_cost() allows to provide the ‘cost’ values which reflect theefficiency of the CPUs. This would allow to provide EAS information whichhas different relation than what would be forced by the EM internalformulas calculating ‘cost’ values. To register an EM for such platform, thedriver must set the flag ‘microwatts’ to 0, provide .get_power() callbackand provide .get_cost() callback. The EM framework would handle such platformproperly during registration. A flag EM_PERF_DOMAIN_ARTIFICIAL is set for suchplatform. Special care should be taken by other frameworks which are using EMto test and treat this flag properly.

Registration of ‘simple’ EM

The ‘simple’ EM is registered using the framework helper functioncpufreq_register_em_with_opp(). It implements a power model which is tight tomath formula:

Power = C * V^2 * f

The EM which is registered using this method might not reflect correctly thephysics of a real device, e.g. when static power (leakage) is important.

2.3 Accessing performance domains

There are two API functions which provide the access to the energy model:em_cpu_get() which takes CPU id as an argument andem_pd_get() with devicepointer as an argument. It depends on the subsystem which interface it isgoing to use, but in case of CPU devices both functions return the sameperformance domain.

Subsystems interested in the energy model of a CPU can retrieve it using theem_cpu_get() API. The energy model tables are allocated once upon creation ofthe performance domains, and kept in memory untouched.

The energy consumed by a performance domain can be estimated using theem_cpu_energy() API. The estimation is performed assuming that the schedutilCPUfreq governor is in use in case of CPU device. Currently this calculation isnot provided for other type of devices.

More details about the above APIs can be found in<linux/energy_model.h>or in Section 2.5

2.4 Runtime modifications

Drivers willing to update the EM at runtime should use the following dedicatedfunction to allocate a new instance of the modified EM. The API is listedbelow:

struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);

This allows to allocate a structure which contains the new EM table withalso RCU and kref needed by the EM framework. The ‘structem_perf_table’contains array ‘structem_perf_state state[]’ which is a list of performancestates in ascending order. That list must be populated by the device driverwhich wants to update the EM. The list of frequencies can be taken fromexisting EM (created during boot). The content in the ‘structem_perf_state’must be populated by the driver as well.

This is the API which does the EM update, using RCU pointers swap:

int em_dev_update_perf_domain(struct device *dev,                      struct em_perf_table __rcu *new_table);

Drivers must provide a pointer to the allocated and initialized new EM‘structem_perf_table’. That new EM will be safely used inside the EM frameworkand will be visible to other sub-systems in the kernel (thermal, powercap).The main design goal for this API is to be fast and avoid extra calculationsor memory allocations at runtime. When pre-computed EMs are available in thedevice driver, then it should be possible to simply reuse them with lowperformance overhead.

In order to free the EM, provided earlier by the driver (e.g. when the moduleis unloaded), there is a need to call the API:

void em_table_free(struct em_perf_table __rcu *table);

It will allow the EM framework to safely remove the memory, when there isno other sub-system using it, e.g. EAS.

To use the power values in other sub-systems (like thermal, powercap) there isa need to call API which protects the reader and provide consistency of the EMtable data:

struct em_perf_state *em_perf_state_from_pd(struct em_perf_domain *pd);

It returns the ‘structem_perf_state’ pointer which is an array of performancestates in ascending order.This function must be called in the RCU read lock section (after thercu_read_lock()). When the EM table is not needed anymore there is a need tocallrcu_real_unlock(). In this way the EM safely uses the RCU read sectionand protects the users. It also allows the EM framework to manage the memoryand free it. More details how to use it can be found in Section 3.2 in theexample driver.

There is dedicated API for device drivers to calculate em_perf_state::costvalues:

int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,                         int nr_states);

These ‘cost’ values from EM are used in EAS. The new EM table should be passedtogether with the number of entries and device pointer. When the computationof the cost values is done properly the return value from the function is 0.The function takes care for right setting of inefficiency for each performancestate as well. It updates em_perf_state::flags accordingly.Then such prepared new EM can be passed to theem_dev_update_perf_domain()function, which will allow to use it.

More details about the above APIs can be found in<linux/energy_model.h>or in Section 3.2 with an example code showing simple implementation of theupdating mechanism in a device driver.

2.5 Description details of this API

structem_perf_state

Performance state of a performance domain

Definition:

struct em_perf_state {    unsigned long performance;    unsigned long frequency;    unsigned long power;    unsigned long cost;    unsigned long flags;};

Members

performance

CPU performance (capacity) at a given frequency

frequency

The frequency in KHz, for consistency with CPUFreq

power

The power consumed at this level (by 1 CPU or by a registereddevice). It can be a total power: static and dynamic.

cost

The cost coefficient associated with this level, used duringenergy calculation. Equal to: 10 * power * max_frequency / frequency

flags

see “em_perf_state flags” description below.

structem_perf_table

Performance states table

Definition:

struct em_perf_table {    struct rcu_head rcu;    struct kref kref;    struct em_perf_state state[];};

Members

rcu

RCU used for safe access and destruction

kref

Reference counter to track the users

state

List of performance states, in ascending order

structem_perf_domain

Performance domain

Definition:

struct em_perf_domain {    struct em_perf_table  *em_table;    struct list_head node;    int id;    int nr_perf_states;    int min_perf_state;    int max_perf_state;    unsigned long flags;    unsigned long cpus[];};

Members

em_table

Pointer to the runtime modifiable em_perf_table

node

node in em_pd_list (in energy_model.c)

id

A unique ID number for each performance domain

nr_perf_states

Number of performance states

min_perf_state

Minimum allowed Performance State index

max_perf_state

Maximum allowed Performance State index

flags

See “em_perf_domain flags”

cpus

Cpumask covering the CPUs of the domain. It’s herefor performance reasons to avoid potential cachemisses during energy calculations in the schedulerand simplifies allocating/freeing that memory region.

Description

In case of CPU device, a “performance domain” represents a group of CPUswhose performance is scaled together. All CPUs of a performance domainmust have the same micro-architecture. Performance domains often havea 1-to-1 mapping with CPUFreq policies. In case of other devices thecpusfield is unused.

intem_pd_get_efficient_state(structem_perf_state*table,structem_perf_domain*pd,unsignedlongmax_util)

Get an efficient performance state from the EM

Parameters

structem_perf_state*table

List of performance states, in ascending order

structem_perf_domain*pd

performance domain for which this must be done

unsignedlongmax_util

Max utilization to map with the EM

Description

It is called from the scheduler code quite frequently and as a consequencedoesn’t implement any check.

Return

An efficient performance state id, high enough to meetmax_utilrequirement.

unsignedlongem_cpu_energy(structem_perf_domain*pd,unsignedlongmax_util,unsignedlongsum_util,unsignedlongallowed_cpu_cap)

Estimates the energy consumed by the CPUs of a performance domain

Parameters

structem_perf_domain*pd

performance domain for which energy has to be estimated

unsignedlongmax_util

highest utilization among CPUs of the domain

unsignedlongsum_util

sum of the utilization of all CPUs in the domain

unsignedlongallowed_cpu_cap

maximum allowed CPU capacity for thepd, whichmight reflect reduced frequency (due to thermal)

Description

This function must be used only for CPU devices. There is no validation,i.e. if the EM is a CPU type and has cpumask allocated. It is called fromthe scheduler code quite frequently and that is why there is not checks.

Return

the sum of the energy consumed by the CPUs of the domain assuminga capacity state satisfying the max utilization of the domain.

intem_pd_nr_perf_states(structem_perf_domain*pd)

Get the number of performance states of a perf. domain

Parameters

structem_perf_domain*pd

performance domain for which this must be done

Return

the number of performance states in the performance domain table

structem_perf_state*em_perf_state_from_pd(structem_perf_domain*pd)

Get the performance states table of perf. domain

Parameters

structem_perf_domain*pd

performance domain for which this must be done

Description

To use this function thercu_read_lock() should be hold. After the usageof the performance states table is finished, thercu_read_unlock() shouldbe called.

Return

the pointer to performance states table of the performance domain

intem_dev_update_perf_domain(structdevice*dev,structem_perf_table*new_table)

Update runtime EM table for a device

Parameters

structdevice*dev

Device for which the EM is to be updated

structem_perf_table*new_table

The new EM table that is going to be used from now

Description

Update EM runtime modifiable table for thedev using the providedtable.

This function uses a mutex to serialize writers, so it must not be calledfrom a non-sleeping context.

Return 0 on success or an error code on failure.

structem_perf_domain*em_pd_get(structdevice*dev)

Return the performance domain for a device

Parameters

structdevice*dev

Device to find the performance domain for

Description

Returns the performance domain to whichdev belongs, or NULL if it doesn’texist.

structem_perf_domain*em_cpu_get(intcpu)

Return the performance domain for a CPU

Parameters

intcpu

CPU to find the performance domain for

Description

Returns the performance domain to whichcpu belongs, or NULL if it doesn’texist.

intem_dev_register_perf_domain(structdevice*dev,unsignedintnr_states,conststructem_data_callback*cb,constcpumask_t*cpus,boolmicrowatts)

Register the Energy Model (EM) for a device

Parameters

structdevice*dev

Device for which the EM is to register

unsignedintnr_states

Number of performance states to register

conststructem_data_callback*cb

Callback functions providing the data of the Energy Model

constcpumask_t*cpus

Pointer to cpumask_t, which in case of a CPU device isobligatory. It can be taken from i.e. ‘policy->cpus’. For othertype of devices this should be set to NULL.

boolmicrowatts

Flag indicating that the power values are in micro-Watts orin some other scale. It must be set properly.

Description

Create Energy Model tables for a performance domain using the callbacksdefined in cb.

Themicrowatts is important to set with correct value. Some kernelsub-systems might rely on this flag and check if all devices in the EM areusing the same scale.

If multiple clients register the same performance domain, all but the firstregistration will be ignored.

Return 0 on success

intem_dev_register_pd_no_update(structdevice*dev,unsignedintnr_states,conststructem_data_callback*cb,constcpumask_t*cpus,boolmicrowatts)

Register a perf domain for a device

Parameters

structdevice*dev

Device to register the PD for

unsignedintnr_states

Number of performance states in the new PD

conststructem_data_callback*cb

Callback functions for populating the energy model

constcpumask_t*cpus

CPUs to include in the new PD (mandatory ifdev is a CPU device)

boolmicrowatts

Whether or not the power values in the EM will be in uW

Description

Likeem_dev_register_perf_domain(), but does not trigger a CPU capacityupdate after registering the PD, even ifdev is a CPU device.

voidem_dev_unregister_perf_domain(structdevice*dev)

Unregister Energy Model (EM) for a device

Parameters

structdevice*dev

Device for which the EM is registered

Description

Unregister the EM for the specifieddev (but not a CPU device).

intem_dev_update_chip_binning(structdevice*dev)

Update Energy Model after the new voltage information is present in the OPPs.

Parameters

structdevice*dev

Device for which the Energy Model has to be updated.

Description

This function allows to update easily the EM with new values available inthe OPP framework and DT. It can be used after the chip has been properlyverified by device drivers and the voltages adjusted for the ‘chip binning’.

intem_update_performance_limits(structem_perf_domain*pd,unsignedlongfreq_min_khz,unsignedlongfreq_max_khz)

Update Energy Model with performance limits information.

Parameters

structem_perf_domain*pd

Performance Domain with EM that has to be updated.

unsignedlongfreq_min_khz

New minimum allowed frequency for this device.

unsignedlongfreq_max_khz

New maximum allowed frequency for this device.

Description

This function allows to update the EM with information about availableperformance levels. It takes the minimum and maximum frequency in kHzand does internal translation to performance levels.Returns 0 on success or -EINVAL when failed.

3. Examples

3.1 Example driver with EM registration

The CPUFreq framework supports dedicated callback for registeringthe EM for a given CPU(s) ‘policy’ object: cpufreq_driver::register_em().That callback has to be implemented properly for a given driver,because the framework would call it at the right time during setup.This section provides a simple example of a CPUFreq driver registering aperformance domain in the Energy Model framework using the (fake) ‘foo’protocol. The driver implements anest_power() function to be provided to theEM framework:

-> drivers/cpufreq/foo_cpufreq.c01    static int est_power(struct device *dev, unsigned long *mW,02                    unsigned long *KHz)03    {04            long freq, power;0506            /* Use the 'foo' protocol to ceil the frequency */07            freq = foo_get_freq_ceil(dev, *KHz);08            if (freq < 0);09                    return freq;1011            /* Estimate the power cost for the dev at the relevant freq. */12            power = foo_estimate_power(dev, freq);13            if (power < 0);14                    return power;1516            /* Return the values to the EM framework */17            *mW = power;18            *KHz = freq;1920            return 0;21    }2223    static void foo_cpufreq_register_em(struct cpufreq_policy *policy)24    {25            struct em_data_callback em_cb = EM_DATA_CB(est_power);26            struct device *cpu_dev;27            int nr_opp;2829            cpu_dev = get_cpu_device(cpumask_first(policy->cpus));3031            /* Find the number of OPPs for this policy */32            nr_opp = foo_get_nr_opp(policy);3334            /* And register the new performance domain */35            em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus,36                                        true);37    }3839    static struct cpufreq_driver foo_cpufreq_driver = {40            .register_em = foo_cpufreq_register_em,41    };

3.2 Example driver with EM modification

This section provides a simple example of a thermal driver modifying the EM.The driver implements afoo_thermal_em_update() function. The driver is wokenup periodically to check the temperature and modify the EM data:

-> drivers/soc/example/example_em_mod.c01    static void foo_get_new_em(struct foo_context *ctx)02    {03            struct em_perf_table __rcu *em_table;04            struct em_perf_state *table, *new_table;05            struct device *dev = ctx->dev;06            struct em_perf_domain *pd;07            unsigned long freq;08            int i, ret;0910            pd = em_pd_get(dev);11            if (!pd)12                    return;1314            em_table = em_table_alloc(pd);15            if (!em_table)16                    return;1718            new_table = em_table->state;1920            rcu_read_lock();21            table = em_perf_state_from_pd(pd);22            for (i = 0; i < pd->nr_perf_states; i++) {23                    freq = table[i].frequency;24                    foo_get_power_perf_values(dev, freq, &new_table[i]);25            }26            rcu_read_unlock();2728            /* Calculate 'cost' values for EAS */29            ret = em_dev_compute_costs(dev, new_table, pd->nr_perf_states);30            if (ret) {31                    dev_warn(dev, "EM: compute costs failed %d\n", ret);32                    em_table_free(em_table);33                    return;34            }3536            ret = em_dev_update_perf_domain(dev, em_table);37            if (ret) {38                    dev_warn(dev, "EM: update failed %d\n", ret);39                    em_table_free(em_table);40                    return;41            }4243            /*44             * Since it's one-time-update drop the usage counter.45             * The EM framework will later free the table when needed.46             */47            em_table_free(em_table);48    }4950    /*51     * Function called periodically to check the temperature and52     * update the EM if needed53     */54    static void foo_thermal_em_update(struct foo_context *ctx)55    {56            struct device *dev = ctx->dev;57            int cpu;5859            ctx->temperature = foo_get_temp(dev, ctx);60            if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)61                    return;6263            foo_get_new_em(ctx);64    }