Movatterモバイル変換
[0]ホーム
[PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks
Kim, JonathanJonathan.Kim at amd.com
Fri Mar 14 14:43:48 UTC 2025
[Public]> -----Original Message-----> From: Alex Deucher <alexdeucher at gmail.com>> Sent: Thursday, March 13, 2025 10:38 AM> To: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>> Cc: Koenig, Christian <Christian.Koenig at amd.com>; amd->gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>;> Kim, Jonathan <Jonathan.Kim at amd.com>; Zhu, Jiadong> <Jiadong.Zhu at amd.com>> Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by> removing dynamic callbacks>> I think as long as the locking is correct, the src shouldn't matter.> You just need to stop the kernel queues (and save state) and evict the> user queues (since HWS is responsible for saving the MQDs of the> non-guilty user queues). If KFD detected the hang (e.g., queue> eviction fails when called for memory pressure, etc.), we just need to> make sure that it's ok for the sdma reset routine to call evict queues> again even if it was already called (presumably it should exit early> since the queues are already evicted). If KGD initiates the reset, it> will call into KFD to evict queues. We just need to make sure the> evict queues function we call just evicts the queues and doesn't also> try and reset.If we're removing the src parameter, we need to be careful we don't end up in a double lock scenario in the KFD.i.e. kgd inits reset -> kfd detects hang on kgd reset trigger and calls back to kgd -> amdgpu_amdkfd_suspend gets called again but is blocked on previous suspend call from original kgd reset (which is holding a bunch of KFD locks) while KFD is trying to clean up immediately.Safest way to remove the parameter seems like to move the KFD suspend/restore outside of the common call and have KGD call suspend/resume when it's calling the common call itself.Jon>> Alex>> On Wed, Mar 12, 2025 at 5:24 AM Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>> wrote:> >> > [AMD Official Use Only - AMD Internal Distribution Only]> >> >> >> >> >> >> > From: Koenig, Christian <Christian.Koenig at amd.com>> > Sent: Wednesday, March 12, 2025 4:39 PM> > To: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>;amd-gfx at lists.freedesktop.org> > Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Kim, Jonathan> <Jonathan.Kim at amd.com>; Zhu, Jiadong <Jiadong.Zhu at amd.com>> > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by> removing dynamic callbacks> >> >> >> > Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):> >> > [SNIP9> >> > -> >> > + gfx_ring->funcs->stop_queue(adev, instance_id);> >> >> >> > Yeah that starts to look good. Question here is who is calling> amdgpu_sdma_reset_engine()?> >> >> >> > If this call comes from engine specific code we might not need the> start/stop_queue callbacks all together.> >> >> >> > Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and> start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.> >> >> > Why would the KFD call this as well? Because it detects an issue with a SDMA> user queue If yes I would rather suggest that the KFD calls the reset function of> the paging queue.> >> > Since this reset function is specific to the SDMA HW generation anyway you don't> need those extra functions to abstract starting and stopping of the queue for each> HW generation.> >> > kfd can't call reset function directly, unless we add a parameter src to distinguish> kfd and kgd in reset function, like this:> >> > int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src );> >> > As Alex said in another thread,> >> > We need to distinguish kfd and kgd in reset.> >> > If kfd triggers a reset, kgd must save healthy jobs and recover jobs after reset.> >> > If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps kfd> needs to save its healthy jobs for recovery).> >> >> >> > If we can add a parameter, I am ok for that solution too.> >> >> >> > Additionally:> >> > For sdma6/7, when a queue reset fails, we may need to fall back to an engine> reset for a attempt.> >> >> >> > Thanks> >> > Jesse> >> >> > Regards,> > Christian.> >> >> >> >> >> >> > Thanks> >> > Jesse> >> >> >> > Regards,> >> > Christian.> >> >> >> > /* Perform the SDMA reset for the specified instance */> >> > ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);> >> > if (ret) {> >> > @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device> *adev, uint32_t instance_id, b> >> > goto exit;> >> > }> >> >> >> > - /* Invoke all registered post_reset callbacks */> >> > - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {> >> > - if (funcs->post_reset) {> >> > - ret = funcs->post_reset(adev, instance_id);> >> > - if (ret) {> >> > - dev_err(adev->dev,> >> > - "afterReset callback failed for instance %u: %d\n",> >> > - instance_id, ret);> >> > - goto exit;> >> > - }> >> > - }> >> > - }> >> > + gfx_ring->funcs->start_queue(adev, instance_id);> >> >> >> > exit:> >> > /* Restart the scheduler's work queue for the GFX and page rings> >> > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c> >> > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c> >> > index fd34dc138081..c1f7ccff9c4e 100644> >> > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c> >> > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c> >> > @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs> sdma_v4_4_2_ring_funcs = {> >> > .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,> >> > .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,> >> > .reset = sdma_v4_4_2_reset_queue,> >> > + .stop_queue = sdma_v4_4_2_stop_queue,> >> > + .start_queue = sdma_v4_4_2_restore_queue,> >> > .is_guilty = sdma_v4_4_2_ring_is_guilty, };> >> >> >> >> >> >
More information about the amd-gfxmailing list
[8]ページ先頭