Xe Device Coredump

Xe uses dev_coredump infrastructure for exposing the crash errors in astandardized way. Once a crash occurs, devcoredump exposes a temporarynode under/sys/class/devcoredump/devcd<m>/. The same node is alsoaccessible in/sys/class/drm/card<n>/device/devcoredump/. Thefailing_device symlink points to the device that crashed and created thecoredump.

The following characteristics are observed by xe when creating a devicecoredump:

Snapshot at hang:

The ‘data’ file contains a snapshot of the HW and driver states at the timethe hang happened. Due to the driver recovering from resets/crashes, it maynot correspond to the state of the system when the file is read byuserspace.

Coredump release:

After a coredump is generated, it stays in kernel memory until released byuserspace by writing anything to it, or after an internal timer expires. Theexact timeout may vary and should not be relied upon. Example to releasea coredump:

$>/sys/class/drm/card0/device/devcoredump/data
First failure only:

In general, the first hang is the most critical one since the followinghangs can be a consequence of the initial hang. For this reason a snapshotis taken only for the first failure. Until the devcoredump is released byuserspace or kernel, all subsequent hangs do not override the snapshot norcreate new ones. Devcoredump has a delayed work queue that will eventuallydelete the file node and free all the dump information.

Internal API

ssize_txe_devcoredump_read(char*buffer,loff_toffset,size_tcount,void*data,size_tdatalen)

Read data from the Xe device coredump snapshot

Parameters

char*buffer

Destination buffer to copy the coredump data into

loff_toffset

Offset in the coredump data to start reading from

size_tcount

Number of bytes to read

void*data

Pointer to the xe_devcoredump structure

size_tdatalen

Length of the data (unused)

Description

Reads a chunk of the coredump snapshot data into the provided buffer.If the devcoredump is smaller than 1.5 GB (XE_DEVCOREDUMP_CHUNK_MAX),it is read directly from a pre-written buffer. For larger devcoredumps,the pre-written buffer must be periodically repopulated from the snapshotstate due to kmalloc size limitations.

Return

Number of bytes copied on success, or a negative error code on failure.

voidxe_devcoredump(structxe_exec_queue*q,structxe_sched_job*job,constchar*fmt,...)

Take the required snapshots and initialize coredump device.

Parameters

structxe_exec_queue*q

The faulty xe_exec_queue, where the issue was detected.

structxe_sched_job*job

The faulty xe_sched_job, where the issue was detected.

constchar*fmt

Printf format + args to describe the reason for the core dump

...

variable arguments

Description

This function should be called at the crash time within the serializedgt_reset. It is skipped if we still have the core dump device availablewith the information of the ‘first’ snapshot.

voidxe_print_blob_ascii85(structdrm_printer*p,constchar*prefix,charsuffix,constvoid*blob,size_toffset,size_tsize)

print a BLOB to some useful location in ASCII85

Parameters

structdrm_printer*p

the printer object to output to

constchar*prefix

optional prefix to add to output string

charsuffix

optional suffix to add at the end. 0 disables it and isnot added to the output, which is useful when using multiple callsto dump data top

constvoid*blob

the Binary Large OBject to dump out

size_toffset

offset in bytes to skip from the front of the BLOB, must be a multiple of sizeof(u32)

size_tsize

the size in bytes of the BLOB, must be a multiple of sizeof(u32)

Description

The output is split into multiple calls todrm_puts() because some printtargets, e.g. dmesg, cannot handle arbitrarily long lines. These targets mayadd newlines, as is the case with dmesg: eachdrm_puts() call creates aseparate line.

There is also a scheduler yield call to prevent the ‘task has been stuck for120s’ kernel hang check feature from firing when printing to a slow targetsuch as dmesg over a serial port.