I915 GuC Submission/DRM Scheduler Section

Upstream plan

For upstream the overall plan for landing GuC submission and integrating thei915 with the DRM scheduler is:

  • Merge basic GuC submission
    • Basic submission support for all gen11+ platforms

    • Not enabled by default on any current platforms but can be enabled viamodparam enable_guc

    • Lots of rework will need to be done to integrate with DRM scheduler sono need to nit pick everything in the code, it just should befunctional, no major coding style / layering errors, and not regressexeclists

    • Update IGTs / selftests as needed to work with GuC submission

    • Enable CI on supported platforms for a baseline

    • Rework / get CI heathly for GuC submission in place as needed

  • Merge new parallel submission uAPI
    • Bonding uAPI completely incompatible with GuC submission, plus it hassevere design issues in general, which is why we want to retire it nomatter what

    • New uAPI adds I915_CONTEXT_ENGINES_EXT_PARALLEL context setup stepwhich configures a slot with N contexts

    • After I915_CONTEXT_ENGINES_EXT_PARALLEL a user can submit N batches toa slot in a single execbuf IOCTL and the batches run on the GPU inparallel

    • Initially only for GuC submission but execlists can be supported ifneeded

  • Convert the i915 to use the DRM scheduler
    • GuC submission backend fully integrated with DRM scheduler
      • All request queues removed from backend (e.g. all backpressurehandled in DRM scheduler)

      • Resets / cancels hook in DRM scheduler

      • Watchdog hooks into DRM scheduler

      • Lots of complexity of the GuC backend can be pulled out onceintegrated with DRM scheduler (e.g. state machine getssimpler, locking gets simpler, etc...)

    • Execlists backend will minimum required to hook in the DRM scheduler
      • Legacy interface

      • Features like timeslicing / preemption / virtual engines wouldbe difficult to integrate with the DRM scheduler and thesefeatures are not required for GuC submission as the GuC doesthese things for us

      • ROI low on fully integrating into DRM scheduler

      • Fully integrating would add lots of complexity to DRMscheduler

    • Port i915 priority inheritance / boosting feature in DRM scheduler
      • Used for i915 page flip, may be useful to other DRM drivers aswell

      • Will be an optional feature in the DRM scheduler

    • Remove in-order completion assumptions from DRM scheduler
      • Even when using the DRM scheduler the backends will handlepreemption, timeslicing, etc... so it is possible for jobs tofinish out of order

    • Pull out i915 priority levels and use DRM priority levels

    • Optimize DRM scheduler as needed

TODOs for GuC submission upstream

  • Need an update to GuC firmware / i915 to enable error state capture

  • Open source tool to decode GuC logs

  • Public GuC spec

New uAPI for basic GuC submission

No major changes are required to the uAPI for basic GuC submission. The onlychange is a new scheduler attribute: I915_SCHEDULER_CAP_STATIC_PRIORITY_MAP.This attribute indicates the 2k i915 user priority levels are statically mappedinto 3 levels as follows:

  • -1k to -1 Low priority

  • 0 Medium priority

  • 1 to 1k High priority

This is needed because the GuC only has 4 priority bands. The highest priorityband is reserved with the kernel. This aligns with the DRM scheduler prioritylevels too.

Spec references:

New parallel submission uAPI

The existing bonding uAPI is completely broken with GuC submission becausewhether a submission is a single context submit or parallel submit isn’t knownuntil execbuf time activated via the I915_SUBMIT_FENCE. To submit multiplecontexts in parallel with the GuC the context must be explicitly registered withN contexts and all N contexts must be submitted in a single command to the GuC.The GuC interfaces do not support dynamically changing between N contexts as thebonding uAPI does. Hence the need for a new parallel submission interface. Alsothe legacy bonding uAPI is quite confusing and not intuitive at all. FurthermoreI915_SUBMIT_FENCE is by design a future fence, so not really something we shouldcontinue to support.

The new parallel submission uAPI consists of 3 parts:

  • Export engines logical mapping

  • A ‘set_parallel’ extension to configure contexts for parallelsubmission

  • Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL

Export engines logical mapping

Certain use cases require BBs to be placed on engine instances in logical order(e.g. split-frame on gen11+). The logical mapping of engine instances can changebased on fusing. Rather than making UMDs be aware of fusing, simply expose thelogical mapping with the existing query engine info IOCTL. Also the GuCsubmission interface currently only supports submitting multiple contexts toengines in logical order which is a new requirement compared to execlists.Lastly, all current platforms have at most 2 engine instances and the logicalorder is the same as uAPI order. This will change on platforms with more than 2engine instances.

A single bit will be added to drm_i915_engine_info.flags indicating that thelogical instance has been returned and a new field,drm_i915_engine_info.logical_instance, returns the logical instance.

A ‘set_parallel’ extension to configure contexts for parallel submission

The ‘set_parallel’ extension configures a slot for parallel submission of N BBs.It is a setup step that must be called before using any of the contexts. SeeI915_CONTEXT_ENGINES_EXT_LOAD_BALANCE or I915_CONTEXT_ENGINES_EXT_BOND forsimilar existing examples. Once a slot is configured for parallel submission theexecbuf2 IOCTL can be called submitting N BBs in a single IOCTL. Initially onlysupports GuC submission. Execlists supports can be added later if needed.

Add I915_CONTEXT_ENGINES_EXT_PARALLEL_SUBMIT anddrm_i915_context_engines_parallel_submit to the uAPI to implement thisextension.

structi915_context_engines_parallel_submit

Configure engine for parallel submission.

Definition:

struct i915_context_engines_parallel_submit {    struct i915_user_extension base;    __u16 engine_index;    __u16 width;    __u16 num_siblings;    __u16 mbz16;    __u64 flags;    __u64 mbz64[3];    struct i915_engine_class_instance engines[];};

Members

base

base user extension.

engine_index

slot for parallel engine

width

number of contexts per parallel engine or in other words thenumber of batches in each submission

num_siblings

number of siblings per context or in other words thenumber of possible placements for each submission

mbz16

reserved for future use; must be zero

flags

all undefined flags must be zero, currently not defined flags

mbz64

reserved for future use; must be zero

engines

2-d array of engine instances to configure parallel engine

length = width (i) * num_siblings (j)index = j + i * num_siblings

Description

Setup a slot in the context engine map to allow multiple BBs to be submittedin a single execbuf IOCTL. Those BBs will then be scheduled to run on the GPUin parallel. Multiple hardware contexts are created internally in the i915 torun these BBs. Once a slot is configured for N BBs only N BBs can besubmitted in each execbuf IOCTL and this is implicit behavior e.g. The userdoesn’t tell the execbuf IOCTL there are N BBs, the execbuf IOCTL knows howmany BBs there are based on the slot’s configuration. The N BBs are the lastN buffer objects or first N if I915_EXEC_BATCH_FIRST is set.

The default placement behavior is to create implicit bonds between eachcontext if each context maps to more than 1 physical engine (e.g. context isa virtual engine). Also we only allow contexts of same engine class and thesecontexts must be in logically contiguous order. Examples of the placementbehavior are described below. Lastly, the default is to not allow BBs to bepreempted mid-batch. Rather insert coordinated preemption points on allhardware contexts between each set of BBs. Flags could be added in the futureto change both of these default behaviors.

Returns -EINVAL if hardware context placement configuration is invalid or ifthe placement configuration isn’t supported on the platform / submissioninterface.Returns -ENODEV if extension isn’t supported on the platform / submissioninterface.

Examples syntax:CS[X] = generic engine of same class, logical instance XINVALID = I915_ENGINE_CLASS_INVALID, I915_ENGINE_CLASS_INVALID_NONEExample 1 pseudo code:set_engines(INVALID)set_parallel(engine_index=0, width=2, num_siblings=1,             engines=CS[0],CS[1])Results in the following valid placement:CS[0], CS[1]Example 2 pseudo code:set_engines(INVALID)set_parallel(engine_index=0, width=2, num_siblings=2,             engines=CS[0],CS[2],CS[1],CS[3])Results in the following valid placements:CS[0], CS[1]CS[2], CS[3]This can be thought of as two virtual engines, each containing twoengines thereby making a 2D array. However, there are bonds tying theentries together and placing restrictions on how they can be scheduled.Specifically, the scheduler can choose only vertical columns from the 2Darray. That is, CS[0] is bonded to CS[1] and CS[2] to CS[3]. So if thescheduler wants to submit to CS[0], it must also choose CS[1] and viceversa. Same for CS[2] requires also using CS[3].VE[0] = CS[0], CS[2]VE[1] = CS[1], CS[3]Example 3 pseudo code:set_engines(INVALID)set_parallel(engine_index=0, width=2, num_siblings=2,             engines=CS[0],CS[1],CS[1],CS[3])Results in the following valid and invalid placements:CS[0], CS[1]CS[1], CS[3] - Not logically contiguous, return -EINVAL

Extend execbuf2 IOCTL to support submitting N BBs in a single IOCTL

Contexts that have been configured with the ‘set_parallel’ extension can onlysubmit N BBs in a single execbuf2 IOCTL. The BBs are either the last N objectsin the drm_i915_gem_exec_object2 list or the first N if I915_EXEC_BATCH_FIRST isset. The number of BBs is implicit based on the slot submitted and how it hasbeen configured by ‘set_parallel’ or other extensions. No uAPI changes arerequired to the execbuf2 IOCTL.