Xe Device Coredump¶
Xe uses dev_coredump infrastructure for exposing the crash errors in astandardized way. Once a crash occurs, devcoredump exposes a temporarynode under/sys/class/devcoredump/devcd<m>/. The same node is alsoaccessible in/sys/class/drm/card<n>/device/devcoredump/. Thefailing_device symlink points to the device that crashed and created thecoredump.
The following characteristics are observed by xe when creating a devicecoredump:
- Snapshot at hang:
The ‘data’ file contains a snapshot of the HW and driver states at the timethe hang happened. Due to the driver recovering from resets/crashes, it maynot correspond to the state of the system when the file is read byuserspace.
- Coredump release:
After a coredump is generated, it stays in kernel memory until released byuserspace by writing anything to it, or after an internal timer expires. Theexact timeout may vary and should not be relied upon. Example to releasea coredump:
$>/sys/class/drm/card0/device/devcoredump/data
- First failure only:
In general, the first hang is the most critical one since the followinghangs can be a consequence of the initial hang. For this reason a snapshotis taken only for the first failure. Until the devcoredump is released byuserspace or kernel, all subsequent hangs do not override the snapshot norcreate new ones. Devcoredump has a delayed work queue that will eventuallydelete the file node and free all the dump information.
Internal API¶
- ssize_txe_devcoredump_read(char*buffer,loff_toffset,size_tcount,void*data,size_tdatalen)¶
Read data from the Xe device coredump snapshot
Parameters
char*bufferDestination buffer to copy the coredump data into
loff_toffsetOffset in the coredump data to start reading from
size_tcountNumber of bytes to read
void*dataPointer to the xe_devcoredump structure
size_tdatalenLength of the data (unused)
Description
Reads a chunk of the coredump snapshot data into the provided buffer.If the devcoredump is smaller than 1.5 GB (XE_DEVCOREDUMP_CHUNK_MAX),it is read directly from a pre-written buffer. For larger devcoredumps,the pre-written buffer must be periodically repopulated from the snapshotstate due to kmalloc size limitations.
Return
Number of bytes copied on success, or a negative error code on failure.
- voidxe_devcoredump(structxe_exec_queue*q,structxe_sched_job*job,constchar*fmt,...)¶
Take the required snapshots and initialize coredump device.
Parameters
structxe_exec_queue*qThe faulty xe_exec_queue, where the issue was detected.
structxe_sched_job*jobThe faulty xe_sched_job, where the issue was detected.
constchar*fmtPrintf format + args to describe the reason for the core dump
...variable arguments
Description
This function should be called at the crash time within the serializedgt_reset. It is skipped if we still have the core dump device availablewith the information of the ‘first’ snapshot.
- voidxe_print_blob_ascii85(structdrm_printer*p,constchar*prefix,charsuffix,constvoid*blob,size_toffset,size_tsize)¶
print a BLOB to some useful location in ASCII85
Parameters
structdrm_printer*pthe printer object to output to
constchar*prefixoptional prefix to add to output string
charsuffixoptional suffix to add at the end. 0 disables it and isnot added to the output, which is useful when using multiple callsto dump data top
constvoid*blobthe Binary Large OBject to dump out
size_toffsetoffset in bytes to skip from the front of the BLOB, must be a multiple of sizeof(u32)
size_tsizethe size in bytes of the BLOB, must be a multiple of sizeof(u32)
Description
The output is split into multiple calls todrm_puts() because some printtargets, e.g. dmesg, cannot handle arbitrarily long lines. These targets mayadd newlines, as is the case with dmesg: eachdrm_puts() call creates aseparate line.
There is also a scheduler yield call to prevent the ‘task has been stuck for120s’ kernel hang check feature from firing when printing to a slow targetsuch as dmesg over a serial port.