NotificationsYou must be signed in to change notification settings
Fork113
Star397

drm/amdgpu: Propagate amd_acquire errors in rdma_get_pages#194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

mattaezell wants to merge2,290 commits intoROCm:master

base:master

Choose a base branch

frommattaezell:amdgpu_rdma_get_pages_amd_acquire

Open

drm/amdgpu: Propagate amd_acquire errors in rdma_get_pages#194

mattaezell wants to merge2,290 commits intoROCm:masterfrommattaezell:amdgpu_rdma_get_pages_amd_acquire

Conversation

Copy link

Contributor

mattaezell commentedJul 31, 2025

amd_acquire returns 1 on success and 0 on failure. rdma_get_pages needs to return non-zero if amd_acquire fails.

Originally found by Chuck Fossen from HPE

pldrcand others added30 commits

February 5, 2025 16:23

drm/amdgpu: Enable devcoredump for JPEG4_0_3

d6525f7

Add register list and enable devcoredump for JPEG4_0_3V2: (Lijo) - remove version specific callbacks and use simplified helper functionsV3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amdgpu: Enable devcoredump for JPEG5_0_1

3a24351

Add register list and enable devcoredump for JPEG5_0_1V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amdgpu: Enable devcoredump for JPEG4_0_0

5fbd2f5

Add register list and enable devcoredump for JPEG4_0_0V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amdgpu: Enable devcoredump for JPEG4_0_5

476b404

Add register list and enable devcoredump for JPEG4_0_5V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amdgpu: Enable devcoredump for JPEG3_0_0

fbd6586

Add register list and enable devcoredump for JPEG3_0_0V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amdgpu: Enable devcoredump for JPEG2_0_0

398a212

Add register list and enable devcoredump for JPEG2_0_0V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amdgpu: Enable devcoredump for JPEG2_5_0

e508f93

Add register list and enable devcoredump for JPEG2_5_0V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amdgpu: Enable devcoredump for JPEG5_0_0

31097ba

Add register list and enable devcoredump for JPEG5_0_0V2: (Lijo) - remove version specific callbacks and use simplified helper functionsV3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amd/pm: Limit to 8 jpeg rings per instance

012c345

JPEG 5.0.1 supports upto 10 rings, however PMFW support for SMU v13.0.6variants is now limited to 8 per instance. Limit to 8 temporarily toavoid out of bounds access.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Respect user's CONFIG_FRAME_WARN more for dml files

4d3d72c

Currently, there are several files in drm/amd/display that aim to have ahigher -Wframe-larger-than value to avoid instances of that warning witha lower value from the user's configuration. However, with the way thatit is currently implemented, it does not respect the user's request viaCONFIG_FRAME_WARN for a higher stack frame limit, which can cause painwhen new instances of the warning appear and break the build due toCONFIG_WERROR.Adjust the logic to switch from a hard coded -Wframe-larger-than valueto only using the value as a minimum clamp and deferring to therequested value from CONFIG_FRAME_WARN if it is higher.Suggested-by: Harry Wentland <harry.wentland@amd.com>Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>Closes:https://lore.kernel.org/2025013003-audience-opposing-7f95@gregkh/Signed-off-by: Nathan Chancellor <nathan@kernel.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Remove unused mpc1_is_mpcc_idle

db81010

mpc1_is_mpcc_idle() was added in 2017 bycommitfeb4a3c ("drm/amd/display: Integrating MPC pseudocode")but never used.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Remove unused freesync functions

aa12e6c

mod_freesync_get_vmin_vmax() and mod_freesync_get_v_position() wereadded in 2017 bycommit72ada5f ("drm/amd/display: FreeSync Auto Sweep Support")mod_freesync_is_valid_range() was added in 2018 bycommite80e944 ("drm/amd/display: add method to check for supportedrange")mod_freesync_get_settings() was added in 2018 bycommita3e1737 ("drm/amd/display: Implement stats logging")andmod_freesync_calc_field_rate_from_timing() was added in 2020 bycommit49c70ec ("drm/amd/display: Change input parameter forset_drr")None of these have been used.Remove them.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Remove unused dc_stream_get_crtc_position

b848ca8

The last user of dc_stream_get_crtc_position() wasmod_freesync_get_v_position() which is removed in a previouspatch in this series.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Remove unused get_clock_requirements_for_state

8d55b67

get_clock_requirements_for_state() was added in 2018 bycommit8ab2180 ("drm/amd/display: Add function to fetch clockrequirements")but never used.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Remove unused hubbub1_toggle_watermark_change_req

fa18162

hubbub1_toggle_watermark_change_req() last use was removed in 2017 bycommitb8fce2c ("drm/amd/display: Optimize programming front end")Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Remove unused get_max_support_fbc_buffersize

2033e64

get_max_support_fbc_buffersize() is unused since 2021'scommit94f0d0c ("drm/amd/display/dc/dce110/dce110_compressor: Removeunused function 'dce110_get_required_compressed_surfacesize")removed it's only caller.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Remove unused link_enc_cfg_get_link_enc_used_by_stream

e2a39cc

link_enc_cfg_get_link_enc_used_by_stream() is no longer used after2021's:commit6366b00 ("drm/amd/display: Maintain consistent mode ofoperation during encoder assignment")which introduces and uses the _current version instead.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Revert "drm/amd/display: Use HW lock mgr for PSR1"

b8159e7

This reverts commit2a69ae1e1354 ("drm/amd/display: Use HW lock mgr for PSR1")Because it may cause system hang while connect with two edp panel.Acked-by: Wayne Lin <wayne.lin@amd.com>Signed-off-by: Tom Chung <chiahsuan.chung@amd.com>

drm/amdkfd: Ensure consistent barrier state saved in gfx12 trap handler

c60139f

It is possible for some waves in a workgroup to finish their savesequence before the group leader has had time to capture the workgroupbarrier state.  When this happens, having those waves exit do impact thebarrier state.  As a consequence, the state captured by the group leaderis invalid, and is eventually incorrectly restored.This patch proposes to have all waves in a workgroup wait for each otherat the end of their save sequence (just before calling s_endpgm_saved).Signed-off-by: Lancelot SIX <lancelot.six@amd.com>Reviewed-by: Jay Cornwall <jay.cornwall@amd.com>

drm/amd/display: Replace pr_info in dc_validate_boot_timing()

63332eb

Use DC_LOG_DEBUG instead of pr_info to match other uses in dc.c.Fixes: eb8eec752038 ("drm/amd/display: Add debug messages for dc_validate_boot_timing()")Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>Signed-off-by: Alex Hung <alex.hung@amd.com>

drm/amdgpu/sdma4: drop gfxoff calls in dump ip state

2e89a30

SDMA 4.x is not part of the GFX power domain so this isnot necessary.Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: Clean up atom header file inclusion

f5ae4ad

atom bios header files are not required in these files.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/pm: Add APIs for device access checks

803881e

Wrap the checks before device access in helper functions and use themfor device access. The generic order of APIs now is to do input argumentvalidation first and check if device access is allowed.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Feifei Xu <feifei.xu@amd.com>Reviewed-by: Asad Kamal <asad.kamal@amd.com>

drm/amd/pm: Fix get_if_active usage

df1dfeb

If a device supports runtime pm, then pm_runtime_get_if_active returns 0if a device is not active and 1 if already active. However, if a devicedoesn't support runtime pm, the API returns -EINVAL. A device notsupporting runtime pm implies it's not affected by runtime pm and it'sactive. Hence no need to get() to increment usage count. Remove < 0return value check. Also, ignore runpm state to determine active status.If the device is already in suspend state, disallow access.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Feifei Xu <feifei.xu@amd.com>Reviewed-by: Asad Kamal <asad.kamal@amd.com>

drm/amd/pm: Remove unnecessary device state checks

5a9d19d

For amdgpu_get_pp_force_state, amdgpu_get_pp_cur_state already takescare of device state check. In other cases, values are returned fromdriver cached variables and are not dependent on device state.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Feifei Xu <feifei.xu@amd.com>Reviewed-by: Asad Kamal <asad.kamal@amd.com>

drm/amdgpu: Remove remaining AMDKCL_AMDGPU_DMABUF_OPS refs

e015bf9

These were missed in the cleanup patch, so remove them to allow the kernel to compile againFixes:083e622 ("drm/amdkcl: cleanup macro AMDKCL_AMDGPU_DMABUF_OPS")Signed-off-by: Kent Russell <kent.russell@amd.com>Reviewed-by: Bob Zhou <Bob.Zhou@amd.com>

drm/amd/include : MES v11 and v12 API header update

f5a4b46

MES requires driver set cleaner_shader_fence_mc_addrfor cleaner shader support.Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Change-Id: Ie7a20254683948735c6c3b9c20e6f0f842ab0720

drm/amdgpu/gfx9: manually control gfxoff for CS on RV

ba6bbf6

When mesa started using compute queues more oftenwe started seeing additional hangs with compute queues.Disabling gfxoff seems to mitigate that.  Manuallycontrol gfxoff and gfx pg with command submissions to avoidany issues related to gfxoff.  KFD already does the samething for these chips.v2: limit to computev3: limit to APUsv4: limit to Raven/PCOv5: only update the compute ring_funcsv6: Disable GFX PGv7: adjust orderReviewed-by: Lijo Lazar <lijo.lazar@amd.com>Suggested-by: Błażej Szczygieł <mumei6102@gmail.com>Suggested-by: Sergey Kovalenko <seryoga.engineering@gmail.com>Link:https://gitlab.freedesktop.org/drm/amd/-/issues/3861Link:https://lists.freedesktop.org/archives/amd-gfx/2025-January/119116.htmlSigned-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: bump version for RV/PCO compute fix

caebc39

Bump the driver version for RV/PCO compute stability fixso mesa can use this check to enable compute queues onRV/PCO.Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx: add amdgpu_gfx_off_ctrl_immediate()

184b804

Same as amdgpu_gfx_off_ctrl(), but without the delayfor gfxoff disallow.Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>Suggested-by: Błażej Szczygieł <mumei6102@gmail.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Yifan Zhaand others added24 commits

May 27, 2025 09:37

drm/amdgpu: refine MES register print for devices of hive

9ffb3f8

[Why]Register access print missed device info.[How]Using dev_xxx instead of DRM_xxx to indicate which deviceof a hive is the message for.Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx10: Refine Cleaner Shader for GFX10.1.10

244ece2

This patch updates the cleaner shader, which is responsible forinitializing GPU resources such as Local Data Share (LDS), VectorGeneral Purpose Registers (VGPRs), and Scalar General Purpose Registers(SGPRs). Changes include adjustments to register clearing and shaderconfiguration.- Updated GPU resource initialization addresses in the cleaner shader  from `be803080` to `be803000`.- Simplified the logic in the SGPR clearing section, ensuring all SGPRs  are set to zero.Fixes:25961ba ("drm/amdgpu/gfx10: Add cleaner shader for GFX10.1.10")Cc: Christian König <christian.koenig@amd.com>Cc: Alex Deucher <alexander.deucher@amd.com>Signed-off-by: Manu Rastogi <manu.rastogi@amd.com>Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/pm: Remove host limit metrics support

e3b69b1

Firmware algorithm changed and the values in this versionare not accurate thereby remove host limit metric supportfor smu_v13_0_6, smu_v13_0_12 & smu_v13_0_14Signed-off-by: Asad Kamal <asad.kamal@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amd/pm: Update smu metrics table for smu_v13_0_6

8313d5c

Update smu metrics table to vesrion 0x10 for smu_v13_0_6v2: Host metrics support removal moved to separate patch (Lijo)Signed-off-by: Asad Kamal <asad.kamal@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

drm/amdgpu: Add pldm version reporting

8d8da71

Add pldm version reporting through sysfs nodeSigned-off-by: Asad Kamal <asad.kamal@amd.com>Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amd/pm: Update pmfw headers for smu_v_13_0_6

f3a51b7

Update pmfw headers for smu_v_13_0_6 to include pldm versionas part of statics metrics tableSigned-off-by: Asad Kamal <asad.kamal@amd.com>Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amd/pm: Fill pldm version for SMU v13.0.6 SOCs

84efc8e

Fetch pldm version from static metrics table for SMU v13.0.6 SOCsSigned-off-by: Asad Kamal <asad.kamal@amd.com>Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amd/pm: Enable static metrics table support

a7fa379

Enable static metrics support to fetch board voltage and pldm versionfor other smu_v13_0_6 programSigned-off-by: Asad Kamal <asad.kamal@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amd/pm: Enable static metrics table support

6f0f111

Enable static metrics support to fetch board voltage and pldm versionfor smu_v13_0_14Signed-off-by: Asad Kamal <asad.kamal@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: Increase KIQ invalidate_tlbs timeout

34f1dca

KIQ invalidate_tlbs request has been seen to marginally exceed theconfigured 100 ms timeout on systems under load.All other KIQ requests in the driver use a 10 second timeout. Use asimilar timeout implementation on the invalidate_tlbs path.v2: Poll once before msleepv3: Fix return valueSigned-off-by: Jay Cornwall <jay.cornwall@amd.com>Cc: Kent Russell <kent.russell@amd.com>Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>(cherry picked from commit efc206db3bcf6a5ce4dc3ecb97aba4b4548ffcc2)Change-Id: I932ad73e0a6bc71d4a917ab9ed69f1f071f29fef

drm/amdgpu: seq64 memory unmap uses uninterruptible lock

0e668cb

To unmap and free seq64 memory when drm node close to free vm, if thereis signal accepted, then taking vm lock failed and leaking seq64 vamapping, and then dmesg has error log "still active bo inside vm".Change to use uninterruptible lock fix the mapping leaking and no dmesgerror log.Signed-off-by: Philip Yang <Philip.Yang@amd.com>Reviewed-by: Christian König <christian.koenig@amd.com>

drm/scheduler: signal scheduled fence when kill job

3a670d0

When an entity from application B is killed, drm_sched_entity_kill()removes all jobs belonging to that entity throughdrm_sched_entity_kill_jobs_work(). If application A's job depends on ascheduled fence from application B's job, and that fence is not properlysignaled during the killing process, application A's dependency cannotbe cleared.This leads to application A hanging indefinitely while waiting for adependency that will never be resolved. Fix this issue by ensuring thatscheduled fences are properly signaled when an entity is killed,allowing dependent applications to continue execution.Signed-off-by: Lin.Cao <lincao12@amd.com>Reviewed-by: Christian König <christian.koenig@amd.com>Reviewed-by: Philipp Stanner <phasta@kernel.org>

drm/amdgpu: Add kicker device detection

2aa2944

1. add kicker device list2. add kicker device checking helper functionSigned-off-by: Frank Min <Frank.Min@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: add kicker fws loading for gfx11/smu13/psp13

8119782

1. Add kicker firmwares loading for gfx11/smu13/psp132. Register additional MODULE_FIRMWARE entries for kicker fws   - gc_11_0_0_rlc_kicker.bin   - gc_11_0_0_imu_kicker.bin   - psp_13_0_0_sos_kicker.bin   - psp_13_0_0_ta_kicker.bin   - smu_13_0_0_kicker.binSigned-off-by: Frank Min <Frank.Min@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: Add basic validation for RAS header

cd07dd2

If RAS header read from EEPROM is corrupted, it could result in tryingto allocate huge memory for reading the records. Add some validation toheader fields.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: Reset RAS table if header is invalid

00a8c3d

If a valid header is not found during RAS eeprom init, consider it asnew and reset RAS table info.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUs

9269655

Enable the cleaner shader for other GFX9.x series of GPUs to providedata isolation between GPU workloads. The cleaner shader is responsiblefor clearing the Local Data Store (LDS), Vector General PurposeRegisters (VGPRs), and Scalar General Purpose Registers (SGPRs), whichhelps prevent data leakage and ensures accurate computation results.This update extends cleaner shader support to GFX9.x GPUs, previouslyavailable for GFX9.4.2. It enhances security by clearing GPU memorybetween processes and maintains a consistent GPU state across KGD andKFD workloads.Cc: Manu Rastogi <manu.rastogi@amd.com>Cc: Christian König <christian.koenig@amd.com>Cc: Alex Deucher <alexander.deucher@amd.com>Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: adjust enforce_isolation handling

a711eb1

Switch from a bool to an enum and allow more optionsfor enforce isolation.  There are now 3 modes of operation:- Disabled (0)- Enabled (serialization and cleaner shader) (1)- Enabled in legacy mode (no serialization or cleaner shader) (2)This provides better flexibility for more use cases.Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: Add Support for enforcing isolation without Cleaner Shader

4a1616a

Adjusted the enforce isolation setting handling to include the abilityto disable the cleaner shader without affecting isolation between tasks.v2: Updated enforce isolation documentation and parameters. (Alex)Cc: Christian König <christian.koenig@amd.com>Cc: Alex Deucher <alexander.deucher@amd.com>Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdkfd: Don't call mmput from MMU notifier callback

11ae9d1

If the process is exiting, the mmput inside mmu notifier callback fromcompactd or fork or numa balancing could release the last referenceof mm struct to call exit_mmap and free_pgtable, this triggers deadlockwith below backtrace.The deadlock will leak kfd process as mmu notifier release is not calledand cause VRAM leaking.The fix is to take mm reference mmget_non_zero when adding prange to thedeferred list to pair with mmput in deferred list work.If prange split and add into pchild list, the pchild work_item.mm is notused, so remove the mm parameter from svm_range_unmap_split andsvm_range_add_child.The backtrace of hung task: INFO: task python:348105 blocked for more than 64512 seconds. Call Trace:  __schedule+0x1c3/0x550  schedule+0x46/0xb0  rwsem_down_write_slowpath+0x24b/0x4c0  unlink_anon_vmas+0xb1/0x1c0  free_pgtables+0xa9/0x130  exit_mmap+0xbc/0x1a0  mmput+0x5a/0x140  svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]  mn_itree_invalidate+0x72/0xc0  __mmu_notifier_invalidate_range_start+0x48/0x60  try_to_unmap_one+0x10fa/0x1400  rmap_walk_anon+0x196/0x460  try_to_unmap+0xbb/0x210  migrate_page_unmap+0x54d/0x7e0  migrate_pages_batch+0x1c3/0xae0  migrate_pages_sync+0x98/0x240  migrate_pages+0x25c/0x520  compact_zone+0x29d/0x590  compact_zone_order+0xb6/0xf0  try_to_compact_pages+0xbe/0x220  __alloc_pages_direct_compact+0x96/0x1a0  __alloc_pages_slowpath+0x410/0x930  __alloc_pages_nodemask+0x3a9/0x3e0  do_huge_pmd_anonymous_page+0xd7/0x3e0  __handle_mm_fault+0x5e3/0x5f0  handle_mm_fault+0xf7/0x2e0  hmm_vma_fault.isra.0+0x4d/0xa0  walk_pmd_range.isra.0+0xa8/0x310  walk_pud_range+0x167/0x240  walk_pgd_range+0x55/0x100  __walk_page_range+0x87/0x90  walk_page_range+0xf6/0x160  hmm_range_fault+0x4f/0x90  amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]  amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]  init_user_pages+0xb1/0x2a0 [amdgpu]  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]  kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]  kfd_ioctl+0x29d/0x500 [amdgpu]Fixes:fa582c6 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")Signed-off-by: Philip Yang <Philip.Yang@amd.com>Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>(cherry picked from commit382c280)Change-Id: Iec202cbf4ad7085cfca1083474e136dc0a822cff

drm/amdgpu: Set HDP_MMHUB_RO_OVERRIDE

f7e4c27

Set HDP_MMHUB_CNTL.HDP_MMHUB_RO_OVERRIDE = 0x0 for gfx943 dGPU. This isneeded for enhanced RCCL performancev2: Set the register only if not already setChange-Id: Ifee2fe308bdb9ce4d8b2c613cc13a09c429d0e7dSigned-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>(cherry picked from commit 2ab8116dfd6351952a74756aba2f9e10a9e26543)

drm/amdgpu: Suspend IH during mode-2 reset

3df32ae

On multi-aid SOCs, there could be a continuous stream of interrupts fromGC after poison consumption. Suspend IH to disable them before doingmode-2 reset. This avoids conflicts in hardware accesses duringinterrupt handlers while a reset is ongoing.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Asad Kamal <asad.kamal@amd.com>

drm/amdgpu: Clear reset flags from ras context

17e8760

Once RAS errors are cleared with appropriate recovery mechanism, clearreset flags also from RAS context. Otherwise, stale flag values couldaffect the subsequent RAS reset handling on the device.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>(cherry picked from commit 2ed5d493e5ae6b86988e70fccbd9554330762080)Change-Id: I106126dc01e203419a4d59e8100e4b249d9fff41

drm/amdgpu: Propage amd_acquire errors in rdma_get_pages

0e43783

amd_acquire returns 1 on success and 0 on failure. rdma_get_pages needsto return non-zero if amd_acquire fails.Originally found by Chuck Fossen from HPESigned-off-by: Matt Ezell <ezellma@ornl.gov>

mattaezell changed the title~~drm/amdgpu: Propage amd_acquire errors in rdma_get_pages~~drm/amdgpu: Propagate amd_acquire errors in rdma_get_pages

Aug 1, 2025

Copy link

Contributor

kentrussell commentedAug 1, 2025

Thanks Matt (and Chuck by extension). I've pulled this internally to the dkms-branch-review process. I'll leave this open until it closes off (and will share any review comments in the interim, should they arise). I'll have an idea on the timeline for which ROCm release will have it once we merge it.
Is this a deal breaker right now (IE breaking software) or is it mostly just to help with some frustrating debugs/

Copy link

ContributorAuthor

mattaezell commentedAug 1, 2025

Thanks!

Is this a deal breaker right now (IE breaking software) or is it mostly just to help with some frustrating debugs/

We are actually getting kernel panics in the cassini driver on Frontier that Chuck thinks are likely due to this. I haven't tracked the whole process, but Cassini thinks it's getting device memory from amdgpu that isn't due to this.

The code team running the app that triggers this found a bug in their code that was sending the driver down this path. They have fixed that in the interim and panics have stopped, so it's not an emergency at the moment.