Energy Model of devices¶
1. Overview¶
The Energy Model (EM) framework serves as an interface between drivers knowingthe power consumed by devices at various performance levels, and the kernelsubsystems willing to use that information to make energy-aware decisions.
The source of the information about the power consumed by devices can vary greatlyfrom one platform to another. These power costs can be estimated usingdevicetree data in some cases. In others, the firmware will know better.Alternatively, userspace might be best positioned. And so on. In order to avoideach and every client subsystem to re-implement support for each and everypossible source of information on its own, the EM framework intervenes as anabstraction layer which standardizes the format of power cost tables in thekernel, hence enabling to avoid redundant work.
The power values might be expressed in micro-Watts or in an ‘abstract scale’.Multiple subsystems might use the EM and it is up to the system integrator tocheck that the requirements for the power value scale types are met. An examplecan be found in the Energy-Aware Scheduler documentationEnergy Aware Scheduling. For some subsystems like thermal orpowercap power values expressed in an ‘abstract scale’ might cause issues.These subsystems are more interested in estimation of power used in the past,thus the real micro-Watts might be needed. An example of these requirements canbe found in the Intelligent Power Allocation inPower allocator governor tunables.Kernel subsystems might implement automatic detection to check whether EMregistered devices have inconsistent scale (based on EM internal flag).Important thing to keep in mind is that when the power values are expressed inan ‘abstract scale’ deriving real energy in micro-Joules would not be possible.
The figure below depicts an example of drivers (Arm-specific here, but theapproach is applicable to any architecture) providing power costs to the EMframework, and interested clients reading the data from it:
+---------------+ +-----------------+ +---------------+| Thermal (IPA) | | Scheduler (EAS) | | Other |+---------------+ +-----------------+ +---------------+ | | em_cpu_energy() | | | em_cpu_get() | +---------+ | +---------+ | | | v v v +---------------------+ | Energy Model | | Framework | +---------------------+ ^ ^ ^ | | | em_dev_register_perf_domain() +----------+ | +---------+ | | | +---------------+ +---------------+ +--------------+ | cpufreq-dt | | arm_scmi | | Other | +---------------+ +---------------+ +--------------+ ^ ^ ^ | | | +--------------+ +---------------+ +--------------+ | Device Tree | | Firmware | | ? | +--------------+ +---------------+ +--------------+
In case of CPU devices the EM framework manages power cost tables per‘performance domain’ in the system. A performance domain is a group of CPUswhose performance is scaled together. Performance domains generally have a1-to-1 mapping with CPUFreq policies. All CPUs in a performance domain arerequired to have the same micro-architecture. CPUs in different performancedomains can have different micro-architectures.
To better reflect power variation due to static power (leakage) the EMsupports runtime modifications of the power values. The mechanism relies onRCU to free the modifiable EM perf_state table memory. Its user, the taskscheduler, also uses RCU to access this memory. The EM framework providesAPI for allocating/freeing the new memory for the modifiable EM table.The old memory is freed automatically using RCU callback mechanism when thereare no owners anymore for the given EM runtime table instance. This is trackedusing kref mechanism. The device driver which provided the new EM at runtime,should call EM API to free it safely when it’s no longer needed. The EMframework will handle the clean-up when it’s possible.
The kernel code which want to modify the EM values is protected from concurrentaccess using a mutex. Therefore, the device driver code must run in sleepingcontext when it tries to modify the EM.
With the runtime modifiable EM we switch from a ‘single and during the entireruntime static EM’ (system property) design to a ‘single EM which can bechanged during runtime according e.g. to the workload’ (system and workloadproperty) design.
It is possible also to modify the CPU performance values for each EM’sperformance state. Thus, the full power and performance profile (whichis an exponential curve) can be changed according e.g. to the workloador system property.
2. Core APIs¶
2.1 Config options¶
CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
2.2 Registration of performance domains¶
Registration of ‘advanced’ EM¶
The ‘advanced’ EM gets its name due to the fact that the driver is allowedto provide more precised power model. It’s not limited to some implemented mathformula in the framework (like it is in ‘simple’ EM case). It can better reflectthe real power measurements performed for each performance state. Thus, thisregistration method should be preferred in case considering EM static power(leakage) is important.
Drivers are expected to register performance domains into the EM framework bycalling the following API:
int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, struct em_data_callback *cb, cpumask_t *cpus, bool microwatts);
Drivers must provide a callback function returning <frequency, power> tuplesfor each performance state. The callback function provided by the driver is freeto fetch data from any relevant location (DT, firmware, ...), and by any meandeemed necessary. Only for CPU devices, drivers must specify the CPUs of theperformance domains using cpumask. For other devices than CPUs the lastargument must be set to NULL.The last argument ‘microwatts’ is important to set with correct value. Kernelsubsystems which use EM might rely on this flag to check if all EM devices usethe same scale. If there are different scales, these subsystems might decideto return warning/error, stop working or panic.See Section 3. for an example of driver implementing thiscallback, or Section 2.4 for further documentation on this API
Registration of EM using DT¶
The EM can also be registered using OPP framework and information in DT“operating-points-v2”. Each OPP entry in DT can be extended with a property“opp-microwatt” containing micro-Watts power value. This OPP DT propertyallows a platform to register EM power values which are reflecting total power(static + dynamic). These power values might be coming directly fromexperiments and measurements.
Registration of ‘artificial’ EM¶
There is an option to provide a custom callback for drivers missing detailedknowledge about power value for each performance state. The callback.get_cost() is optional and provides the ‘cost’ values used by the EAS.This is useful for platforms that only provide information on relativeefficiency between CPU types, where one could use the information tocreate an abstract power model. But even an abstract power model cansometimes be hard to fit in, given the input power value size restrictions.The .get_cost() allows to provide the ‘cost’ values which reflect theefficiency of the CPUs. This would allow to provide EAS information whichhas different relation than what would be forced by the EM internalformulas calculating ‘cost’ values. To register an EM for such platform, thedriver must set the flag ‘microwatts’ to 0, provide .get_power() callbackand provide .get_cost() callback. The EM framework would handle such platformproperly during registration. A flag EM_PERF_DOMAIN_ARTIFICIAL is set for suchplatform. Special care should be taken by other frameworks which are using EMto test and treat this flag properly.
Registration of ‘simple’ EM¶
The ‘simple’ EM is registered using the framework helper functioncpufreq_register_em_with_opp(). It implements a power model which is tight tomath formula:
Power = C * V^2 * f
The EM which is registered using this method might not reflect correctly thephysics of a real device, e.g. when static power (leakage) is important.
2.3 Accessing performance domains¶
There are two API functions which provide the access to the energy model:em_cpu_get() which takes CPU id as an argument andem_pd_get() with devicepointer as an argument. It depends on the subsystem which interface it isgoing to use, but in case of CPU devices both functions return the sameperformance domain.
Subsystems interested in the energy model of a CPU can retrieve it using theem_cpu_get() API. The energy model tables are allocated once upon creation ofthe performance domains, and kept in memory untouched.
The energy consumed by a performance domain can be estimated using theem_cpu_energy() API. The estimation is performed assuming that the schedutilCPUfreq governor is in use in case of CPU device. Currently this calculation isnot provided for other type of devices.
More details about the above APIs can be found in<linux/energy_model.h>or in Section 2.5
2.4 Runtime modifications¶
Drivers willing to update the EM at runtime should use the following dedicatedfunction to allocate a new instance of the modified EM. The API is listedbelow:
struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);
This allows to allocate a structure which contains the new EM table withalso RCU and kref needed by the EM framework. The ‘structem_perf_table’contains array ‘structem_perf_state state[]’ which is a list of performancestates in ascending order. That list must be populated by the device driverwhich wants to update the EM. The list of frequencies can be taken fromexisting EM (created during boot). The content in the ‘structem_perf_state’must be populated by the driver as well.
This is the API which does the EM update, using RCU pointers swap:
int em_dev_update_perf_domain(struct device *dev, struct em_perf_table __rcu *new_table);
Drivers must provide a pointer to the allocated and initialized new EM‘structem_perf_table’. That new EM will be safely used inside the EM frameworkand will be visible to other sub-systems in the kernel (thermal, powercap).The main design goal for this API is to be fast and avoid extra calculationsor memory allocations at runtime. When pre-computed EMs are available in thedevice driver, then it should be possible to simply reuse them with lowperformance overhead.
In order to free the EM, provided earlier by the driver (e.g. when the moduleis unloaded), there is a need to call the API:
void em_table_free(struct em_perf_table __rcu *table);
It will allow the EM framework to safely remove the memory, when there isno other sub-system using it, e.g. EAS.
To use the power values in other sub-systems (like thermal, powercap) there isa need to call API which protects the reader and provide consistency of the EMtable data:
struct em_perf_state *em_perf_state_from_pd(struct em_perf_domain *pd);
It returns the ‘structem_perf_state’ pointer which is an array of performancestates in ascending order.This function must be called in the RCU read lock section (after thercu_read_lock()). When the EM table is not needed anymore there is a need tocallrcu_real_unlock(). In this way the EM safely uses the RCU read sectionand protects the users. It also allows the EM framework to manage the memoryand free it. More details how to use it can be found in Section 3.2 in theexample driver.
There is dedicated API for device drivers to calculate em_perf_state::costvalues:
int em_dev_compute_costs(struct device *dev, struct em_perf_state *table, int nr_states);
These ‘cost’ values from EM are used in EAS. The new EM table should be passedtogether with the number of entries and device pointer. When the computationof the cost values is done properly the return value from the function is 0.The function takes care for right setting of inefficiency for each performancestate as well. It updates em_perf_state::flags accordingly.Then such prepared new EM can be passed to theem_dev_update_perf_domain()function, which will allow to use it.
More details about the above APIs can be found in<linux/energy_model.h>or in Section 3.2 with an example code showing simple implementation of theupdating mechanism in a device driver.
2.5 Description details of this API¶
- structem_perf_state¶
Performance state of a performance domain
Definition:
struct em_perf_state { unsigned long performance; unsigned long frequency; unsigned long power; unsigned long cost; unsigned long flags;};Members
performanceCPU performance (capacity) at a given frequency
frequencyThe frequency in KHz, for consistency with CPUFreq
powerThe power consumed at this level (by 1 CPU or by a registereddevice). It can be a total power: static and dynamic.
costThe cost coefficient associated with this level, used duringenergy calculation. Equal to: 10 * power * max_frequency / frequency
flagssee “em_perf_state flags” description below.
- structem_perf_table¶
Performance states table
Definition:
struct em_perf_table { struct rcu_head rcu; struct kref kref; struct em_perf_state state[];};Members
rcuRCU used for safe access and destruction
krefReference counter to track the users
stateList of performance states, in ascending order
- structem_perf_domain¶
Performance domain
Definition:
struct em_perf_domain { struct em_perf_table *em_table; struct list_head node; int id; int nr_perf_states; int min_perf_state; int max_perf_state; unsigned long flags; unsigned long cpus[];};Members
em_tablePointer to the runtime modifiable em_perf_table
nodenode in em_pd_list (in energy_model.c)
idA unique ID number for each performance domain
nr_perf_statesNumber of performance states
min_perf_stateMinimum allowed Performance State index
max_perf_stateMaximum allowed Performance State index
flagsSee “em_perf_domain flags”
cpusCpumask covering the CPUs of the domain. It’s herefor performance reasons to avoid potential cachemisses during energy calculations in the schedulerand simplifies allocating/freeing that memory region.
Description
In case of CPU device, a “performance domain” represents a group of CPUswhose performance is scaled together. All CPUs of a performance domainmust have the same micro-architecture. Performance domains often havea 1-to-1 mapping with CPUFreq policies. In case of other devices thecpusfield is unused.
- intem_pd_get_efficient_state(structem_perf_state*table,structem_perf_domain*pd,unsignedlongmax_util)¶
Get an efficient performance state from the EM
Parameters
structem_perf_state*tableList of performance states, in ascending order
structem_perf_domain*pdperformance domain for which this must be done
unsignedlongmax_utilMax utilization to map with the EM
Description
It is called from the scheduler code quite frequently and as a consequencedoesn’t implement any check.
Return
An efficient performance state id, high enough to meetmax_utilrequirement.
- unsignedlongem_cpu_energy(structem_perf_domain*pd,unsignedlongmax_util,unsignedlongsum_util,unsignedlongallowed_cpu_cap)¶
Estimates the energy consumed by the CPUs of a performance domain
Parameters
structem_perf_domain*pdperformance domain for which energy has to be estimated
unsignedlongmax_utilhighest utilization among CPUs of the domain
unsignedlongsum_utilsum of the utilization of all CPUs in the domain
unsignedlongallowed_cpu_capmaximum allowed CPU capacity for thepd, whichmight reflect reduced frequency (due to thermal)
Description
This function must be used only for CPU devices. There is no validation,i.e. if the EM is a CPU type and has cpumask allocated. It is called fromthe scheduler code quite frequently and that is why there is not checks.
Return
the sum of the energy consumed by the CPUs of the domain assuminga capacity state satisfying the max utilization of the domain.
- intem_pd_nr_perf_states(structem_perf_domain*pd)¶
Get the number of performance states of a perf. domain
Parameters
structem_perf_domain*pdperformance domain for which this must be done
Return
the number of performance states in the performance domain table
- structem_perf_state*em_perf_state_from_pd(structem_perf_domain*pd)¶
Get the performance states table of perf. domain
Parameters
structem_perf_domain*pdperformance domain for which this must be done
Description
To use this function thercu_read_lock() should be hold. After the usageof the performance states table is finished, thercu_read_unlock() shouldbe called.
Return
the pointer to performance states table of the performance domain
- intem_dev_update_perf_domain(structdevice*dev,structem_perf_table*new_table)¶
Update runtime EM table for a device
Parameters
structdevice*devDevice for which the EM is to be updated
structem_perf_table*new_tableThe new EM table that is going to be used from now
Description
Update EM runtime modifiable table for thedev using the providedtable.
This function uses a mutex to serialize writers, so it must not be calledfrom a non-sleeping context.
Return 0 on success or an error code on failure.
- structem_perf_domain*em_pd_get(structdevice*dev)¶
Return the performance domain for a device
Parameters
structdevice*devDevice to find the performance domain for
Description
Returns the performance domain to whichdev belongs, or NULL if it doesn’texist.
- structem_perf_domain*em_cpu_get(intcpu)¶
Return the performance domain for a CPU
Parameters
intcpuCPU to find the performance domain for
Description
Returns the performance domain to whichcpu belongs, or NULL if it doesn’texist.
- intem_dev_register_perf_domain(structdevice*dev,unsignedintnr_states,conststructem_data_callback*cb,constcpumask_t*cpus,boolmicrowatts)¶
Register the Energy Model (EM) for a device
Parameters
structdevice*devDevice for which the EM is to register
unsignedintnr_statesNumber of performance states to register
conststructem_data_callback*cbCallback functions providing the data of the Energy Model
constcpumask_t*cpusPointer to cpumask_t, which in case of a CPU device isobligatory. It can be taken from i.e. ‘policy->cpus’. For othertype of devices this should be set to NULL.
boolmicrowattsFlag indicating that the power values are in micro-Watts orin some other scale. It must be set properly.
Description
Create Energy Model tables for a performance domain using the callbacksdefined in cb.
Themicrowatts is important to set with correct value. Some kernelsub-systems might rely on this flag and check if all devices in the EM areusing the same scale.
If multiple clients register the same performance domain, all but the firstregistration will be ignored.
Return 0 on success
- intem_dev_register_pd_no_update(structdevice*dev,unsignedintnr_states,conststructem_data_callback*cb,constcpumask_t*cpus,boolmicrowatts)¶
Register a perf domain for a device
Parameters
structdevice*devDevice to register the PD for
unsignedintnr_statesNumber of performance states in the new PD
conststructem_data_callback*cbCallback functions for populating the energy model
constcpumask_t*cpusCPUs to include in the new PD (mandatory ifdev is a CPU device)
boolmicrowattsWhether or not the power values in the EM will be in uW
Description
Likeem_dev_register_perf_domain(), but does not trigger a CPU capacityupdate after registering the PD, even ifdev is a CPU device.
Parameters
structdevice*devDevice for which the EM is registered
Description
Unregister the EM for the specifieddev (but not a CPU device).
- intem_dev_update_chip_binning(structdevice*dev)¶
Update Energy Model after the new voltage information is present in the OPPs.
Parameters
structdevice*devDevice for which the Energy Model has to be updated.
Description
This function allows to update easily the EM with new values available inthe OPP framework and DT. It can be used after the chip has been properlyverified by device drivers and the voltages adjusted for the ‘chip binning’.
- intem_update_performance_limits(structem_perf_domain*pd,unsignedlongfreq_min_khz,unsignedlongfreq_max_khz)¶
Update Energy Model with performance limits information.
Parameters
structem_perf_domain*pdPerformance Domain with EM that has to be updated.
unsignedlongfreq_min_khzNew minimum allowed frequency for this device.
unsignedlongfreq_max_khzNew maximum allowed frequency for this device.
Description
This function allows to update the EM with information about availableperformance levels. It takes the minimum and maximum frequency in kHzand does internal translation to performance levels.Returns 0 on success or -EINVAL when failed.
3. Examples¶
3.1 Example driver with EM registration¶
The CPUFreq framework supports dedicated callback for registeringthe EM for a given CPU(s) ‘policy’ object: cpufreq_driver::register_em().That callback has to be implemented properly for a given driver,because the framework would call it at the right time during setup.This section provides a simple example of a CPUFreq driver registering aperformance domain in the Energy Model framework using the (fake) ‘foo’protocol. The driver implements anest_power() function to be provided to theEM framework:
-> drivers/cpufreq/foo_cpufreq.c01 static int est_power(struct device *dev, unsigned long *mW,02 unsigned long *KHz)03 {04 long freq, power;0506 /* Use the 'foo' protocol to ceil the frequency */07 freq = foo_get_freq_ceil(dev, *KHz);08 if (freq < 0);09 return freq;1011 /* Estimate the power cost for the dev at the relevant freq. */12 power = foo_estimate_power(dev, freq);13 if (power < 0);14 return power;1516 /* Return the values to the EM framework */17 *mW = power;18 *KHz = freq;1920 return 0;21 }2223 static void foo_cpufreq_register_em(struct cpufreq_policy *policy)24 {25 struct em_data_callback em_cb = EM_DATA_CB(est_power);26 struct device *cpu_dev;27 int nr_opp;2829 cpu_dev = get_cpu_device(cpumask_first(policy->cpus));3031 /* Find the number of OPPs for this policy */32 nr_opp = foo_get_nr_opp(policy);3334 /* And register the new performance domain */35 em_dev_register_perf_domain(cpu_dev, nr_opp, &em_cb, policy->cpus,36 true);37 }3839 static struct cpufreq_driver foo_cpufreq_driver = {40 .register_em = foo_cpufreq_register_em,41 };3.2 Example driver with EM modification¶
This section provides a simple example of a thermal driver modifying the EM.The driver implements afoo_thermal_em_update() function. The driver is wokenup periodically to check the temperature and modify the EM data:
-> drivers/soc/example/example_em_mod.c01 static void foo_get_new_em(struct foo_context *ctx)02 {03 struct em_perf_table __rcu *em_table;04 struct em_perf_state *table, *new_table;05 struct device *dev = ctx->dev;06 struct em_perf_domain *pd;07 unsigned long freq;08 int i, ret;0910 pd = em_pd_get(dev);11 if (!pd)12 return;1314 em_table = em_table_alloc(pd);15 if (!em_table)16 return;1718 new_table = em_table->state;1920 rcu_read_lock();21 table = em_perf_state_from_pd(pd);22 for (i = 0; i < pd->nr_perf_states; i++) {23 freq = table[i].frequency;24 foo_get_power_perf_values(dev, freq, &new_table[i]);25 }26 rcu_read_unlock();2728 /* Calculate 'cost' values for EAS */29 ret = em_dev_compute_costs(dev, new_table, pd->nr_perf_states);30 if (ret) {31 dev_warn(dev, "EM: compute costs failed %d\n", ret);32 em_table_free(em_table);33 return;34 }3536 ret = em_dev_update_perf_domain(dev, em_table);37 if (ret) {38 dev_warn(dev, "EM: update failed %d\n", ret);39 em_table_free(em_table);40 return;41 }4243 /*44 * Since it's one-time-update drop the usage counter.45 * The EM framework will later free the table when needed.46 */47 em_table_free(em_table);48 }4950 /*51 * Function called periodically to check the temperature and52 * update the EM if needed53 */54 static void foo_thermal_em_update(struct foo_context *ctx)55 {56 struct device *dev = ctx->dev;57 int cpu;5859 ctx->temperature = foo_get_temp(dev, ctx);60 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)61 return;6263 foo_get_new_em(ctx);64 }