Using the tracer for debugging

Copyright 2024 Google LLC.

Author:

Steven Rostedt <rostedt@goodmis.org>

License:

The GNU Free Documentation License, Version 1.2(dual licensed under the GPL v2)

  • Written for: 6.12

Introduction

The tracing infrastructure can be very useful for debugging the Linuxkernel. This document is a place to add various methods of using the tracerfor debugging.

First, make sure that the tracefs file system is mounted:

$ sudo mount -t tracefs tracefs /sys/kernel/tracing

Using trace_printk()

trace_printk() is a very lightweight utility that can be used in any contextinside the kernel, with the exception of “noinstr” sections. It can be usedin normal, softirq, interrupt and even NMI context. The trace data iswritten to the tracing ring buffer in a lockless way. To make it evenlighter weight, when possible, it will only record the pointer to the formatstring, and save the raw arguments into the buffer. The format and thearguments will be post processed when the ring buffer is read. This way thetrace_printk() format conversions are not done during the hot path, wherethe trace is being recorded.

trace_printk() is meant only for debugging, and should never be added intoa subsystem of the kernel. If you need debugging traces, add trace eventsinstead. If atrace_printk() is found in the kernel, the following willappear in the dmesg:

************************************************************   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   ****                                                      **** trace_printk() being used. Allocating extra memory.  ****                                                      **** This means that this is a DEBUG kernel and it is     **** unsafe for production use.                           ****                                                      **** If you see this message and you are not debugging    **** the kernel, report this immediately to your vendor!  ****                                                      ****   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   ************************************************************

Debugging kernel crashes

There is various methods of acquiring the state of the system when a kernelcrash occurs. This could be from the oops message in printk, or one coulduse kexec/kdump. But these just show what happened at the time of the crash.It can be very useful in knowing what happened up to the point of the crash.The tracing ring buffer, by default, is a circular buffer that willoverwrite older events with newer ones. When a crash happens, the content ofthe ring buffer will be all the events that lead up to the crash.

There are several kernel command line parameters that can be used to help inthis. The first is “ftrace_dump_on_oops”. This will dump the tracing ringbuffer when a oops occurs to the console. This can be useful if the consoleis being logged somewhere. If a serial console is used, it may be prudent tomake sure the ring buffer is relatively small, otherwise the dumping of thering buffer may take several minutes to hours to finish. Here’s an exampleof the kernel command line:

ftrace_dump_on_oops trace_buf_size=50K

Note, the tracing buffer is made up of per CPU buffers where each of thesebuffers is broken up into sub-buffers that are by default PAGE_SIZE. Theabove trace_buf_size option above sets each of the per CPU buffers to 50K,so, on a machine with 8 CPUs, that’s actually 400K total.

Persistent buffers across boots

If the system memory allows it, the tracing ring buffer can be specified ata specific location in memory. If the location is the same across boots andthe memory is not modified, the tracing buffer can be retrieved from thefollowing boot. There’s two ways to reserve memory for the use of the ringbuffer.

The more reliable way (on x86) is to reserve memory with the “memmap” kernelcommand line option and then use that memory for the trace_instance. Thisrequires a bit of knowledge of the physical memory layout of the system. Theadvantage of using this method, is that the memory for the ring buffer willalways be the same:

memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M

The memmap above reserves 12 megabytes of memory at the physical memorylocation 0x284500000. Then the trace_instance option will create a traceinstance “boot_map” at that same location with the same amount of memoryreserved. As the ring buffer is broke up into per CPU buffers, the 12megabytes will be broken up evenly between those CPUs. If you have 8 CPUs,each per CPU ring buffer will be 1.5 megabytes in size. Note, that alsoincludes meta data, so the amount of memory actually used by the ring bufferwill be slightly smaller.

Another more generic but less robust way to allocate a ring buffer mappingat boot is with the “reserve_mem” option:

reserve_mem=12M:4096:trace trace_instance=boot_map@trace

The reserve_mem option above will find 12 megabytes that are available atboot up, and align it by 4096 bytes. It will label this memory as “trace”that can be used by later command line options.

The trace_instance option creates a “boot_map” instance and will use thememory reserved by reserve_mem that was labeled as “trace”. This method ismore generic but may not be as reliable. Due to KASLR, the memory reservedby reserve_mem may not be located at the same location. If this happens,then the ring buffer will not be from the previous boot and will be reset.

Sometimes, by using a larger alignment, it can keep KASLR from moving thingsaround in such a way that it will move the location of the reserve_mem. Byusing a larger alignment, you may find better that the buffer is moreconsistent to where it is placed:

reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace

On boot up, the memory reserved for the ring buffer is validated. It will gothrough a series of tests to make sure that the ring buffer contains validdata. If it is, it will then set it up to be available to read from theinstance. If it fails any of the tests, it will clear the entire ring bufferand initialize it as new.

The layout of this mapped memory may not be consistent from kernel tokernel, so only the same kernel is guaranteed to work if the mapping ispreserved. Switching to a different kernel version may find a differentlayout and mark the buffer as invalid.

NB: Both the mapped address and size must be page aligned for the architecture.

Using trace_printk() in the boot instance

By default, the content oftrace_printk() goes into the top level tracinginstance. But this instance is never preserved across boots. To have thetrace_printk() content, and some other internal tracing go to the preservedbuffer (like dump stacks), either set the instance to be thetrace_printk()destination from the kernel command line, or set it after boot up via thetrace_printk_dest option.

After boot up:

echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest

From the kernel command line:

reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace

If setting it from the kernel command line, it is recommended to alsodisable tracing with the “traceoff” flag, and enable tracing after boot up.Otherwise the trace from the most recent boot will be mixed with the tracefrom the previous boot, and may make it confusing to read.