Net DIM - Generic Network Dynamic Interrupt Moderation¶
- Author:
Tal Gilboa <talgi@mellanox.com>
Assumptions¶
This document assumes the reader has basic knowledge in network driversand in general interrupt moderation.
Introduction¶
Dynamic Interrupt Moderation (DIM) (in networking) refers to changing theinterrupt moderation configuration of a channel in order to optimize packetprocessing. The mechanism includes an algorithm which decides if and how tochange moderation parameters for a channel, usually by performing an analysis onruntime data sampled from the system. Net DIM is such a mechanism. In eachiteration of the algorithm, it analyses a given sample of the data, compares itto the previous sample and if required, it can decide to change some of theinterrupt moderation configuration fields. The data sample is composed of databandwidth, the number of packets and the number of events. The time betweensamples is also measured. Net DIM compares the current and the previous data andreturns an adjusted interrupt moderation configuration object. In some cases,the algorithm might decide not to change anything. The configuration fields arethe minimum duration (microseconds) allowed between events and the maximumnumber of wanted packets per event. The Net DIM algorithm ascribes importance toincrease bandwidth over reducing interrupt rate.
Net DIM Algorithm¶
Each iteration of the Net DIM algorithm follows these steps:
Calculates new data sample.
Compares it to previous sample.
Makes a decision - suggests interrupt moderation configuration fields.
Applies a schedule work function, which applies suggested configuration.
The first two steps are straightforward, both the new and the previous data aresupplied by the driver registered to Net DIM. The previous data is the new datasupplied to the previous iteration. The comparison step checks the differencebetween the new and previous data and decides on the result of the last step.A step would result as “better” if bandwidth increases and as “worse” ifbandwidth reduces. If there is no change in bandwidth, the packet rate iscompared in a similar fashion - increase == “better” and decrease == “worse”.In case there is no change in the packet rate as well, the interrupt rate iscompared. Here the algorithm tries to optimize for lower interrupt rate so anincrease in the interrupt rate is considered “worse” and a decrease isconsidered “better”. Step #2 has an optimization for avoiding false results: itonly considers a difference between samples as valid if it is greater than acertain percentage. Also, since Net DIM does not measure anything by itself, itassumes the data provided by the driver is valid.
Step #3 decides on the suggested configuration based on the result from step #2and the internal state of the algorithm. The states reflect the “direction” ofthe algorithm: is it going left (reducing moderation), right (increasingmoderation) or standing still. Another optimization is that if a decisionto stay still is made multiple times, the interval between iterations of thealgorithm would increase in order to reduce calculation overhead. Also, after“parking” on one of the most left or most right decisions, the algorithm maydecide to verify this decision by taking a step in the other direction. This isdone in order to avoid getting stuck in a “deep sleep” scenario. Once adecision is made, an interrupt moderation configuration is selected fromthe predefined profiles.
The last step is to notify the registered driver that it should apply thesuggested configuration. This is done by scheduling a work function, defined bythe Net DIM API and provided by the registered driver.
As you can see, Net DIM itself does not actively interact with the system. Itwould have trouble making the correct decisions if the wrong data is supplied toit and it would be useless if the work function would not apply the suggestedconfiguration. This does, however, allow the registered driver some room formanoeuvre as it may provide partial data or ignore the algorithm suggestionunder some conditions.
Registering a Network Device to DIM¶
Net DIM API exposes the main functionnet_dim().This function is the entry point to the NetDIM algorithm and has to be called every time the driver would like to check ifit should change interrupt moderation parameters. The driver should provide twodata structures:structdim andstructdim_sample.structdimdescribes the state of DIM for a specific object (RX queue, TX queue,other queues, etc.). This includes the current selected profile, previous datasamples, the callback function provided by the driver and more.structdim_sample describes a data sample,which will be compared to the data sample stored instructdimin order to decide on the algorithm’s nextstep. The sample should include bytes, packets and interrupts, measured bythe driver.
In order to use Net DIM from a networking driver, the driver needs to call themainnet_dim() function. The recommended method is to callnet_dim() on eachinterrupt. Since Net DIM has a built-in moderation and it might decide to skipiterations under certain conditions, there is no need to moderate thenet_dim()calls as well. As mentioned above, the driver needs to provide an object of typestructdim to thenet_dim() function call. It is advised foreach entity using Net DIM to hold astructdim as part of itsdata structure and use it as the main Net DIM API object.Thestructdim_sample should hold the latestbytes, packets and interrupts count. No need to perform any calculations, justinclude the raw data.
Thenet_dim() call itself does not return anything. Instead Net DIM relies onthe driver to provide a callback function, which is called when the algorithmdecides to make a change in the interrupt moderation parameters. This callbackwill be scheduled and run in a separate thread in order not to add overhead tothe data flow. After the work is done, Net DIM algorithm needs to be set tothe proper state in order to move to the next iteration.
Example¶
The following code demonstrates how to register a driver to Net DIM. The actualusage is not complete but it should make the outline of the usage clear.
#include<linux/dim.h>/* Callback for net DIM to schedule on a decision to change moderation */voidmy_driver_do_dim_work(structwork_struct*work){/* Get struct dim from struct work_struct */structdim*dim=container_of(work,structdim,work);/* Do interrupt moderation related stuff */.../* Signal net DIM work is done and it should move to next iteration */dim->state=DIM_START_MEASURE;}/* My driver's interrupt handler */intmy_driver_handle_interrupt(structmy_driver_entity*my_entity,...){.../* A struct to hold current measured data */structdim_sampledim_sample;.../* Initiate data sample struct with current data */dim_update_sample(my_entity->events,my_entity->packets,my_entity->bytes,&dim_sample);/* Call net DIM */net_dim(&my_entity->dim,&dim_sample);...}/* My entity's initialization function (my_entity was already allocated) */intmy_driver_init_my_entity(structmy_driver_entity*my_entity,...){.../* Initiate struct work_struct with my driver's callback function */INIT_WORK(&my_entity->dim.work,my_driver_do_dim_work);...}
Tuning DIM¶
Net DIM serves a range of network devices and delivers excellent accelerationbenefits. Yet, it has been observed that some preset configurations of DIM maynot align seamlessly with the varying specifications of network devices, andthis discrepancy has been identified as a factor to the suboptimal performanceoutcomes of DIM-enabled network devices, related to a mismatch in profiles.
To address this issue, Net DIM introduces a per-device control to modify andaccess a device’srx-profile andtx-profile parameters:Assume that the target network device is named ethx, and ethx only declaressupport for RX profile setting and supports modification ofusec fieldandpkts field (See the data structure:structdim_cq_moder).
You can use ethtool to modify the current RX DIM profile where allvalues are 64:
$ ethtool -C ethx rx-profile 1,1,n_2,2,n_3,n,n_n,4,n_n,n,n
n means do not modify this field, and_ separates structureelements of the profile array.
Querying the current profiles using:
$ ethtool -c ethx...rx-profile:{.usec = 1, .pkts = 1, .comps = n/a,},{.usec = 2, .pkts = 2, .comps = n/a,},{.usec = 3, .pkts = 64, .comps = n/a,},{.usec = 64, .pkts = 4, .comps = n/a,},{.usec = 64, .pkts = 64, .comps = n/a,}tx-profile: n/aIf the network device does not support specific fields of DIM profiles,the correspondingn/a will display. If then/a field is beingmodified, error messages will be reported.
Dynamic Interrupt Moderation (DIM) library API¶
- structdim_cq_moder¶
Structure for CQ moderation values. Used for communications between DIM and its consumer.
Definition:
struct dim_cq_moder { u16 usec; u16 pkts; u16 comps; u8 cq_period_mode; struct rcu_head rcu;};Members
usecCQ timer suggestion (by DIM)
pktsCQ packet counter suggestion (by DIM)
compsCompletion counter
cq_period_modeCQ period count mode (from CQE/EQE)
rcufor asynchronous kfree_rcu
- structdim_irq_moder¶
Structure for irq moderation information. Used to collect irq moderation related information.
Definition:
struct dim_irq_moder { u8 profile_flags; u8 coal_flags; u8 dim_rx_mode; u8 dim_tx_mode; struct dim_cq_moder *rx_profile; struct dim_cq_moder *tx_profile; void (*rx_dim_work)(struct work_struct *work); void (*tx_dim_work)(struct work_struct *work);};Members
profile_flagsDIM_PROFILE_*
coal_flagsDIM_COALESCE_* for Rx and Tx
dim_rx_modeRx DIM period count mode: CQE or EQE
dim_tx_modeTx DIM period count mode: CQE or EQE
rx_profileDIM profile list for Rx
tx_profileDIM profile list for Tx
rx_dim_workRx DIM worker scheduled by
net_dim()tx_dim_workTx DIM worker scheduled by
net_dim()
- structdim_sample¶
Structure for DIM sample data. Used for communications between DIM and its consumer.
Definition:
struct dim_sample { ktime_t time; u32 pkt_ctr; u32 byte_ctr; u16 event_ctr; u32 comp_ctr;};Members
timeSample timestamp
pkt_ctrNumber of packets
byte_ctrNumber of bytes
event_ctrNumber of events
comp_ctrCurrent completion counter
- structdim_stats¶
Structure for DIM stats. Used for holding current measured rates.
Definition:
struct dim_stats { int ppms; int bpms; int epms; int cpms; int cpe_ratio;};Members
ppmsPackets per msec
bpmsBytes per msec
epmsEvents per msec
cpmsCompletions per msec
cpe_ratioRatio of completions to events
- structdim¶
Main structure for dynamic interrupt moderation (DIM). Used for holding all information about a specific DIM instance.
Definition:
struct dim { u8 state; struct dim_stats prev_stats; struct dim_sample start_sample; struct dim_sample measuring_sample; struct work_struct work; void *priv; u8 profile_ix; u8 mode; u8 tune_state; u8 steps_right; u8 steps_left; u8 tired;};Members
stateAlgorithm state (see below)
prev_statsMeasured rates from previous iteration (for comparison)
start_sampleSampled data at start of current iteration
measuring_sampleA
dim_samplethat is used to update the current eventsworkWork to perform on action required
privA pointer to the
structthatpoints to dimprofile_ixCurrent moderation profile
modeCQ period count mode
tune_stateAlgorithm tuning state (see below)
steps_rightNumber of steps taken towards higher moderation
steps_leftNumber of steps taken towards lower moderation
tiredParking depth counter
- enumdim_cq_period_mode¶
Modes for CQ period count
Constants
DIM_CQ_PERIOD_MODE_START_FROM_EQEStart counting from EQE
DIM_CQ_PERIOD_MODE_START_FROM_CQEStart counting from CQE (implies timer reset)
DIM_CQ_PERIOD_NUM_MODESNumber of modes
- enumdim_state¶
DIM algorithm states
Constants
DIM_START_MEASUREThis is the first iteration (also after applying a new profile)
DIM_MEASURE_IN_PROGRESSAlgorithm is already in progress - check ifneed to perform an action
DIM_APPLY_NEW_PROFILEDIM consumer is currently applying a profile - no need to measure
Description
These will determine if the algorithm is in a valid state to start an iteration.
- enumdim_tune_state¶
DIM algorithm tune states
Constants
DIM_PARKING_ON_TOPAlgorithm found a local top point - exit on significant difference
DIM_PARKING_TIREDAlgorithm found a deep top point - don’t exit if tired > 0
DIM_GOING_RIGHTAlgorithm is currently trying higher moderation levels
DIM_GOING_LEFTAlgorithm is currently trying lower moderation levels
Description
These will determine which action the algorithm should perform.
- enumdim_stats_state¶
DIM algorithm statistics states
Constants
DIM_STATS_WORSECurrent iteration shows worse performance than before
DIM_STATS_SAMECurrent iteration shows same performance than before
DIM_STATS_BETTERCurrent iteration shows better performance than before
Description
These will determine the verdict of current iteration.
- enumdim_step_result¶
DIM algorithm step results
Constants
DIM_STEPPEDPerformed a regular step
DIM_TOO_TIREDSame kind of step was done multiple times - should go totired parking
DIM_ON_EDGEStepped to the most left/right profile
Description
These describe the result of a step.
- intnet_dim_init_irq_moder(structnet_device*dev,u8profile_flags,u8coal_flags,u8rx_mode,u8tx_mode,void(*rx_dim_work)(structwork_struct*work),void(*tx_dim_work)(structwork_struct*work))¶
collect information to initialize irq moderation
Parameters
structnet_device*devtarget network device
u8profile_flagsRx or Tx profile modification capability
u8coal_flagsirq moderation params flags
u8rx_modeCQ period mode for Rx
u8tx_modeCQ period mode for Tx
void(*rx_dim_work)(structwork_struct*work)Rx worker called after dim decision
void(*tx_dim_work)(structwork_struct*work)Tx worker called after dim decision
Return
0 on success or a negative error code.
- voidnet_dim_free_irq_moder(structnet_device*dev)¶
free fields for irq moderation
Parameters
structnet_device*devtarget network device
- voidnet_dim_setting(structnet_device*dev,structdim*dim,boolis_tx)¶
initialize DIM’s cq mode and schedule worker
Parameters
structnet_device*devtarget network device
structdim*dimDIM context
boolis_txtrue indicates the tx direction, false indicates the rx direction
Parameters
structdim*dimDIM context
- structdim_cq_modernet_dim_get_rx_irq_moder(structnet_device*dev,structdim*dim)¶
get DIM rx results based on profile_ix
Parameters
structnet_device*devtarget network device
structdim*dimDIM context
Return
DIM irq moderation
- structdim_cq_modernet_dim_get_tx_irq_moder(structnet_device*dev,structdim*dim)¶
get DIM tx results based on profile_ix
Parameters
structnet_device*devtarget network device
structdim*dimDIM context
Return
DIM irq moderation
- voidnet_dim_set_rx_mode(structnet_device*dev,u8rx_mode)¶
set DIM rx cq mode
Parameters
structnet_device*devtarget network device
u8rx_modetarget rx cq mode
- voidnet_dim_set_tx_mode(structnet_device*dev,u8tx_mode)¶
set DIM tx cq mode
Parameters
structnet_device*devtarget network device
u8tx_modetarget tx cq mode
Parameters
structdim*dimDIM context
Description
Check if current profile is a good place to park at.This will result in reducing the DIM checks frequency as we assume weshouldn’t probably change profiles, unless traffic pattern wasn’t changed.
Parameters
structdim*dimDIM context
Description
Go left if we were going right and vice-versa.Do nothing if currently parking.
Parameters
structdim*dimDIM context
Description
Enter parking state.Clear all movement history.
Parameters
structdim*dimDIM context
Description
Enter parking state.Clear all movement history and cause DIM checks frequency to reduce.
- booldim_calc_stats(conststructdim_sample*start,conststructdim_sample*end,structdim_stats*curr_stats)¶
calculate the difference between two samples
Parameters
conststructdim_sample*startstart sample
conststructdim_sample*endend sample
structdim_stats*curr_statsdelta between samples
Description
Calculate the delta between two samples (in data rates).Takes into consideration counter wrap-around.Returned boolean indicates whether curr_stats are reliable.
- voiddim_update_sample(u16event_ctr,u64packets,u64bytes,structdim_sample*s)¶
set a sample’s fields with given values
Parameters
u16event_ctrnumber of events to set
u64packetsnumber of packets to set
u64bytesnumber of bytes to set
structdim_sample*sDIM sample
- voiddim_update_sample_with_comps(u16event_ctr,u64packets,u64bytes,u64comps,structdim_sample*s)¶
set a sample’s fields with given values including the completion parameter
Parameters
u16event_ctrnumber of events to set
u64packetsnumber of packets to set
u64bytesnumber of bytes to set
u64compsnumber of completions to set
structdim_sample*sDIM sample
- structdim_cq_modernet_dim_get_rx_moderation(u8cq_period_mode,intix)¶
provide a CQ moderation object for the given RX profile
Parameters
u8cq_period_modeCQ period mode
intixProfile index
- structdim_cq_modernet_dim_get_def_rx_moderation(u8cq_period_mode)¶
provide the default RX moderation
Parameters
u8cq_period_modeCQ period mode
- structdim_cq_modernet_dim_get_tx_moderation(u8cq_period_mode,intix)¶
provide a CQ moderation object for the given TX profile
Parameters
u8cq_period_modeCQ period mode
intixProfile index
- structdim_cq_modernet_dim_get_def_tx_moderation(u8cq_period_mode)¶
provide the default TX moderation
Parameters
u8cq_period_modeCQ period mode
- voidnet_dim(structdim*dim,conststructdim_sample*end_sample)¶
main DIM algorithm entry point
Parameters
structdim*dimDIM instance information
conststructdim_sample*end_sampleCurrent data measurement
Description
Called by the consumer.This is the main logic of the algorithm, where data is processed in orderto decide on next required action.
Parameters
structdim*dimThe moderation struct.
u64completionsThe number of completions collected in this round.
Description
Each call to rdma_dim takes the latest amount of completions thathave been collected and counts them as a new event.Once enough events have been collected the algorithm decides a newmoderation level.