Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

drm/amdgpu: Propagate amd_acquire errors in rdma_get_pages#194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
mattaezell wants to merge2,290 commits intoROCm:master
base:master
Choose a base branch
Loading
frommattaezell:amdgpu_rdma_get_pages_amd_acquire

Conversation

@mattaezell
Copy link
Contributor

amd_acquire returns 1 on success and 0 on failure. rdma_get_pages needs to return non-zero if amd_acquire fails.

Originally found by Chuck Fossen from HPE

pldrcand others added30 commitsFebruary 5, 2025 16:23
Add register list and enable devcoredump for JPEG4_0_3V2: (Lijo) - remove version specific callbacks and use simplified helper functionsV3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Add register list and enable devcoredump for JPEG5_0_1V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Add register list and enable devcoredump for JPEG4_0_0V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Add register list and enable devcoredump for JPEG4_0_5V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Add register list and enable devcoredump for JPEG3_0_0V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Add register list and enable devcoredump for JPEG2_0_0V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Add register list and enable devcoredump for JPEG2_5_0V2: (Lijo)- remove version specific callbacks and use simplified helper functionsV3: (Lijo)- move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Add register list and enable devcoredump for JPEG5_0_0V2: (Lijo) - remove version specific callbacks and use simplified helper functionsV3: (Lijo) - move amdgpu_jpeg_reg_dump_fini() to sw_fini() and avoid the call hereSigned-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>Reviewed-by: Leo Liu <leo.liu@amd.com>Acked-by: Lijo Lazar <lijo.lazar@amd.com>
JPEG 5.0.1 supports upto 10 rings, however PMFW support for SMU v13.0.6variants is now limited to 8 per instance. Limit to 8 temporarily toavoid out of bounds access.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>
Currently, there are several files in drm/amd/display that aim to have ahigher -Wframe-larger-than value to avoid instances of that warning witha lower value from the user's configuration. However, with the way thatit is currently implemented, it does not respect the user's request viaCONFIG_FRAME_WARN for a higher stack frame limit, which can cause painwhen new instances of the warning appear and break the build due toCONFIG_WERROR.Adjust the logic to switch from a hard coded -Wframe-larger-than valueto only using the value as a minimum clamp and deferring to therequested value from CONFIG_FRAME_WARN if it is higher.Suggested-by: Harry Wentland <harry.wentland@amd.com>Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>Closes:https://lore.kernel.org/2025013003-audience-opposing-7f95@gregkh/Signed-off-by: Nathan Chancellor <nathan@kernel.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
mpc1_is_mpcc_idle() was added in 2017 bycommitfeb4a3c ("drm/amd/display: Integrating MPC pseudocode")but never used.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
mod_freesync_get_vmin_vmax() and mod_freesync_get_v_position() wereadded in 2017 bycommit72ada5f ("drm/amd/display: FreeSync Auto Sweep Support")mod_freesync_is_valid_range() was added in 2018 bycommite80e944 ("drm/amd/display: add method to check for supportedrange")mod_freesync_get_settings() was added in 2018 bycommita3e1737 ("drm/amd/display: Implement stats logging")andmod_freesync_calc_field_rate_from_timing() was added in 2020 bycommit49c70ec ("drm/amd/display: Change input parameter forset_drr")None of these have been used.Remove them.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The last user of dc_stream_get_crtc_position() wasmod_freesync_get_v_position() which is removed in a previouspatch in this series.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
get_clock_requirements_for_state() was added in 2018 bycommit8ab2180 ("drm/amd/display: Add function to fetch clockrequirements")but never used.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
hubbub1_toggle_watermark_change_req() last use was removed in 2017 bycommitb8fce2c ("drm/amd/display: Optimize programming front end")Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
get_max_support_fbc_buffersize() is unused since 2021'scommit94f0d0c ("drm/amd/display/dc/dce110/dce110_compressor: Removeunused function 'dce110_get_required_compressed_surfacesize")removed it's only caller.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
link_enc_cfg_get_link_enc_used_by_stream() is no longer used after2021's:commit6366b00 ("drm/amd/display: Maintain consistent mode ofoperation during encoder assignment")which introduces and uses the _current version instead.Remove it.Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit2a69ae1e1354 ("drm/amd/display: Use HW lock mgr for PSR1")Because it may cause system hang while connect with two edp panel.Acked-by: Wayne Lin <wayne.lin@amd.com>Signed-off-by: Tom Chung <chiahsuan.chung@amd.com>
It is possible for some waves in a workgroup to finish their savesequence before the group leader has had time to capture the workgroupbarrier state.  When this happens, having those waves exit do impact thebarrier state.  As a consequence, the state captured by the group leaderis invalid, and is eventually incorrectly restored.This patch proposes to have all waves in a workgroup wait for each otherat the end of their save sequence (just before calling s_endpgm_saved).Signed-off-by: Lancelot SIX <lancelot.six@amd.com>Reviewed-by: Jay Cornwall <jay.cornwall@amd.com>
Use DC_LOG_DEBUG instead of pr_info to match other uses in dc.c.Fixes: eb8eec752038 ("drm/amd/display: Add debug messages for dc_validate_boot_timing()")Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>Signed-off-by: Alex Hung <alex.hung@amd.com>
SDMA 4.x is not part of the GFX power domain so this isnot necessary.Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
atom bios header files are not required in these files.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Wrap the checks before device access in helper functions and use themfor device access. The generic order of APIs now is to do input argumentvalidation first and check if device access is allowed.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Feifei Xu <feifei.xu@amd.com>Reviewed-by: Asad Kamal <asad.kamal@amd.com>
If a device supports runtime pm, then pm_runtime_get_if_active returns 0if a device is not active and 1 if already active. However, if a devicedoesn't support runtime pm, the API returns -EINVAL. A device notsupporting runtime pm implies it's not affected by runtime pm and it'sactive. Hence no need to get() to increment usage count. Remove < 0return value check. Also, ignore runpm state to determine active status.If the device is already in suspend state, disallow access.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Feifei Xu <feifei.xu@amd.com>Reviewed-by: Asad Kamal <asad.kamal@amd.com>
For amdgpu_get_pp_force_state, amdgpu_get_pp_cur_state already takescare of device state check. In other cases, values are returned fromdriver cached variables and are not dependent on device state.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Feifei Xu <feifei.xu@amd.com>Reviewed-by: Asad Kamal <asad.kamal@amd.com>
These were missed in the cleanup patch, so remove them to allow the kernel to compile againFixes:083e622 ("drm/amdkcl: cleanup macro AMDKCL_AMDGPU_DMABUF_OPS")Signed-off-by: Kent Russell <kent.russell@amd.com>Reviewed-by: Bob Zhou <Bob.Zhou@amd.com>
MES requires driver set cleaner_shader_fence_mc_addrfor cleaner shader support.Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Change-Id: Ie7a20254683948735c6c3b9c20e6f0f842ab0720
When mesa started using compute queues more oftenwe started seeing additional hangs with compute queues.Disabling gfxoff seems to mitigate that.  Manuallycontrol gfxoff and gfx pg with command submissions to avoidany issues related to gfxoff.  KFD already does the samething for these chips.v2: limit to computev3: limit to APUsv4: limit to Raven/PCOv5: only update the compute ring_funcsv6: Disable GFX PGv7: adjust orderReviewed-by: Lijo Lazar <lijo.lazar@amd.com>Suggested-by: Błażej Szczygieł <mumei6102@gmail.com>Suggested-by: Sergey Kovalenko <seryoga.engineering@gmail.com>Link:https://gitlab.freedesktop.org/drm/amd/-/issues/3861Link:https://lists.freedesktop.org/archives/amd-gfx/2025-January/119116.htmlSigned-off-by: Alex Deucher <alexander.deucher@amd.com>
Bump the driver version for RV/PCO compute stability fixso mesa can use this check to enable compute queues onRV/PCO.Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Same as amdgpu_gfx_off_ctrl(), but without the delayfor gfxoff disallow.Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>Suggested-by: Błażej Szczygieł <mumei6102@gmail.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yifan Zhaand others added24 commitsMay 27, 2025 09:37
[Why]Register access print missed device info.[How]Using dev_xxx instead of DRM_xxx to indicate which deviceof a hive is the message for.Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
This patch updates the cleaner shader, which is responsible forinitializing GPU resources such as Local Data Share (LDS), VectorGeneral Purpose Registers (VGPRs), and Scalar General Purpose Registers(SGPRs). Changes include adjustments to register clearing and shaderconfiguration.- Updated GPU resource initialization addresses in the cleaner shader  from `be803080` to `be803000`.- Simplified the logic in the SGPR clearing section, ensuring all SGPRs  are set to zero.Fixes:25961ba ("drm/amdgpu/gfx10: Add cleaner shader for GFX10.1.10")Cc: Christian König <christian.koenig@amd.com>Cc: Alex Deucher <alexander.deucher@amd.com>Signed-off-by: Manu Rastogi <manu.rastogi@amd.com>Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>
Firmware algorithm changed and the values in this versionare not accurate thereby remove host limit metric supportfor smu_v13_0_6, smu_v13_0_12 & smu_v13_0_14Signed-off-by: Asad Kamal <asad.kamal@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Update smu metrics table to vesrion 0x10 for smu_v13_0_6v2: Host metrics support removal moved to separate patch (Lijo)Signed-off-by: Asad Kamal <asad.kamal@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Add pldm version reporting through sysfs nodeSigned-off-by: Asad Kamal <asad.kamal@amd.com>Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Update pmfw headers for smu_v_13_0_6 to include pldm versionas part of statics metrics tableSigned-off-by: Asad Kamal <asad.kamal@amd.com>Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Fetch pldm version from static metrics table for SMU v13.0.6 SOCsSigned-off-by: Asad Kamal <asad.kamal@amd.com>Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Enable static metrics support to fetch board voltage and pldm versionfor other smu_v13_0_6 programSigned-off-by: Asad Kamal <asad.kamal@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Enable static metrics support to fetch board voltage and pldm versionfor smu_v13_0_14Signed-off-by: Asad Kamal <asad.kamal@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
KIQ invalidate_tlbs request has been seen to marginally exceed theconfigured 100 ms timeout on systems under load.All other KIQ requests in the driver use a 10 second timeout. Use asimilar timeout implementation on the invalidate_tlbs path.v2: Poll once before msleepv3: Fix return valueSigned-off-by: Jay Cornwall <jay.cornwall@amd.com>Cc: Kent Russell <kent.russell@amd.com>Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>(cherry picked from commit efc206db3bcf6a5ce4dc3ecb97aba4b4548ffcc2)Change-Id: I932ad73e0a6bc71d4a917ab9ed69f1f071f29fef
To unmap and free seq64 memory when drm node close to free vm, if thereis signal accepted, then taking vm lock failed and leaking seq64 vamapping, and then dmesg has error log "still active bo inside vm".Change to use uninterruptible lock fix the mapping leaking and no dmesgerror log.Signed-off-by: Philip Yang <Philip.Yang@amd.com>Reviewed-by: Christian König <christian.koenig@amd.com>
When an entity from application B is killed, drm_sched_entity_kill()removes all jobs belonging to that entity throughdrm_sched_entity_kill_jobs_work(). If application A's job depends on ascheduled fence from application B's job, and that fence is not properlysignaled during the killing process, application A's dependency cannotbe cleared.This leads to application A hanging indefinitely while waiting for adependency that will never be resolved. Fix this issue by ensuring thatscheduled fences are properly signaled when an entity is killed,allowing dependent applications to continue execution.Signed-off-by: Lin.Cao <lincao12@amd.com>Reviewed-by: Christian König <christian.koenig@amd.com>Reviewed-by: Philipp Stanner <phasta@kernel.org>
1. add kicker device list2. add kicker device checking helper functionSigned-off-by: Frank Min <Frank.Min@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
1. Add kicker firmwares loading for gfx11/smu13/psp132. Register additional MODULE_FIRMWARE entries for kicker fws   - gc_11_0_0_rlc_kicker.bin   - gc_11_0_0_imu_kicker.bin   - psp_13_0_0_sos_kicker.bin   - psp_13_0_0_ta_kicker.bin   - smu_13_0_0_kicker.binSigned-off-by: Frank Min <Frank.Min@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
If RAS header read from EEPROM is corrupted, it could result in tryingto allocate huge memory for reading the records. Add some validation toheader fields.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
If a valid header is not found during RAS eeprom init, consider it asnew and reset RAS table info.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Enable the cleaner shader for other GFX9.x series of GPUs to providedata isolation between GPU workloads. The cleaner shader is responsiblefor clearing the Local Data Store (LDS), Vector General PurposeRegisters (VGPRs), and Scalar General Purpose Registers (SGPRs), whichhelps prevent data leakage and ensures accurate computation results.This update extends cleaner shader support to GFX9.x GPUs, previouslyavailable for GFX9.4.2. It enhances security by clearing GPU memorybetween processes and maintains a consistent GPU state across KGD andKFD workloads.Cc: Manu Rastogi <manu.rastogi@amd.com>Cc: Christian König <christian.koenig@amd.com>Cc: Alex Deucher <alexander.deucher@amd.com>Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Acked-by: Alex Deucher <alexander.deucher@amd.com>
Switch from a bool to an enum and allow more optionsfor enforce isolation.  There are now 3 modes of operation:- Disabled (0)- Enabled (serialization and cleaner shader) (1)- Enabled in legacy mode (no serialization or cleaner shader) (2)This provides better flexibility for more use cases.Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Adjusted the enforce isolation setting handling to include the abilityto disable the cleaner shader without affecting isolation between tasks.v2: Updated enforce isolation documentation and parameters. (Alex)Cc: Christian König <christian.koenig@amd.com>Cc: Alex Deucher <alexander.deucher@amd.com>Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
If the process is exiting, the mmput inside mmu notifier callback fromcompactd or fork or numa balancing could release the last referenceof mm struct to call exit_mmap and free_pgtable, this triggers deadlockwith below backtrace.The deadlock will leak kfd process as mmu notifier release is not calledand cause VRAM leaking.The fix is to take mm reference mmget_non_zero when adding prange to thedeferred list to pair with mmput in deferred list work.If prange split and add into pchild list, the pchild work_item.mm is notused, so remove the mm parameter from svm_range_unmap_split andsvm_range_add_child.The backtrace of hung task: INFO: task python:348105 blocked for more than 64512 seconds. Call Trace:  __schedule+0x1c3/0x550  schedule+0x46/0xb0  rwsem_down_write_slowpath+0x24b/0x4c0  unlink_anon_vmas+0xb1/0x1c0  free_pgtables+0xa9/0x130  exit_mmap+0xbc/0x1a0  mmput+0x5a/0x140  svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]  mn_itree_invalidate+0x72/0xc0  __mmu_notifier_invalidate_range_start+0x48/0x60  try_to_unmap_one+0x10fa/0x1400  rmap_walk_anon+0x196/0x460  try_to_unmap+0xbb/0x210  migrate_page_unmap+0x54d/0x7e0  migrate_pages_batch+0x1c3/0xae0  migrate_pages_sync+0x98/0x240  migrate_pages+0x25c/0x520  compact_zone+0x29d/0x590  compact_zone_order+0xb6/0xf0  try_to_compact_pages+0xbe/0x220  __alloc_pages_direct_compact+0x96/0x1a0  __alloc_pages_slowpath+0x410/0x930  __alloc_pages_nodemask+0x3a9/0x3e0  do_huge_pmd_anonymous_page+0xd7/0x3e0  __handle_mm_fault+0x5e3/0x5f0  handle_mm_fault+0xf7/0x2e0  hmm_vma_fault.isra.0+0x4d/0xa0  walk_pmd_range.isra.0+0xa8/0x310  walk_pud_range+0x167/0x240  walk_pgd_range+0x55/0x100  __walk_page_range+0x87/0x90  walk_page_range+0xf6/0x160  hmm_range_fault+0x4f/0x90  amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]  amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]  init_user_pages+0xb1/0x2a0 [amdgpu]  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]  kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]  kfd_ioctl+0x29d/0x500 [amdgpu]Fixes:fa582c6 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")Signed-off-by: Philip Yang <Philip.Yang@amd.com>Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>(cherry picked from commit382c280)Change-Id: Iec202cbf4ad7085cfca1083474e136dc0a822cff
Set HDP_MMHUB_CNTL.HDP_MMHUB_RO_OVERRIDE = 0x0 for gfx943 dGPU. This isneeded for enhanced RCCL performancev2: Set the register only if not already setChange-Id: Ifee2fe308bdb9ce4d8b2c613cc13a09c429d0e7dSigned-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>(cherry picked from commit 2ab8116dfd6351952a74756aba2f9e10a9e26543)
On multi-aid SOCs, there could be a continuous stream of interrupts fromGC after poison consumption. Suspend IH to disable them before doingmode-2 reset. This avoids conflicts in hardware accesses duringinterrupt handlers while a reset is ongoing.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Once RAS errors are cleared with appropriate recovery mechanism, clearreset flags also from RAS context. Otherwise, stale flag values couldaffect the subsequent RAS reset handling on the device.Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>(cherry picked from commit 2ed5d493e5ae6b86988e70fccbd9554330762080)Change-Id: I106126dc01e203419a4d59e8100e4b249d9fff41
amd_acquire returns 1 on success and 0 on failure. rdma_get_pages needsto return non-zero if amd_acquire fails.Originally found by Chuck Fossen from HPESigned-off-by: Matt Ezell <ezellma@ornl.gov>
@mattaezellmattaezell changed the titledrm/amdgpu: Propage amd_acquire errors in rdma_get_pagesdrm/amdgpu: Propagate amd_acquire errors in rdma_get_pagesAug 1, 2025
@kentrussell
Copy link
Contributor

Thanks Matt (and Chuck by extension). I've pulled this internally to the dkms-branch-review process. I'll leave this open until it closes off (and will share any review comments in the interim, should they arise). I'll have an idea on the timeline for which ROCm release will have it once we merge it.
Is this a deal breaker right now (IE breaking software) or is it mostly just to help with some frustrating debugs/

@mattaezell
Copy link
ContributorAuthor

Thanks!

Is this a deal breaker right now (IE breaking software) or is it mostly just to help with some frustrating debugs/

We are actually getting kernel panics in the cassini driver on Frontier that Chuck thinks are likely due to this. I haven't tracked the whole process, but Cassini thinks it's getting device memory from amdgpu that isn't due to this.

The code team running the app that triggers this found a bug in their code that was sending the driver down this path. They have fixed that in the interim and panics have stopped, so it's not an emergency at the moment.

@kentrussell
Copy link
Contributor

So it's not in ROCm 7.0.* or older releases yet. It's going to be out in ROCm 7.1. At long last. Keeping it open still until then though

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

30 participants

@mattaezell@kentrussell@pldrc@nathanchance@lancesix@alexdeucher@jiangliu@superm1@candicelicy@PhilipYangA@hkasivis@srishanm@dayatsin-amd@jcornwallAMD@xiaogang-chen-amd@EmilyDeng666@andmar-amd@fcui-amd@amd-rfechney@Lawstorant@tony-amd@vskvorts@ArvindYadavAMD@AMD-ShaneXiao@ChristianKoenigAMD@charliu-AMDENG@AMD-aric@WangSungHuai@vprosyak@frank98753

[8]ページ先頭

©2009-2025 Movatter.jp