Discovering Linux kernel subsystems used by a workload

Authors:
maintained-by:

Shuah Khan <skhan@linuxfoundation.org>

Key Points

  • Understanding system resources necessary to build and run a workloadis important.

  • Linux tracing and strace can be used to discover the system resourcesin use by a workload. The completeness of the system usage informationdepends on the completeness of coverage of a workload.

  • Performance and security of the operating system can be analyzed withthe help of tools such as:perf,stress-ng,paxtest.

  • Once we discover and understand the workload needs, we can focus on themto avoid regressions and use it to evaluate safety considerations.

Methodology

strace is adiagnostic, instructional, and debugging tool and can be used to discoverthe system resources in use by a workload. Once we discover and understandthe workload needs, we can focus on them to avoid regressions and use itto evaluate safety considerations. We use strace tool to trace workloads.

This method of tracing using strace tells us the system calls invoked bythe workload and doesn’t include all the system calls that can be invokedby it. In addition, this tracing method tells us just the code paths withinthese system calls that are invoked. As an example, if a workload opens afile and reads from it successfully, then the success path is the one thatis traced. Any error paths in that system call will not be traced. If thereis a workload that provides full coverage of a workload then the methodoutlined here will trace and find all possible code paths. The completenessof the system usage information depends on the completeness of coverage of aworkload.

The goal is tracing a workload on a system running a default kernel withoutrequiring custom kernel installs.

How do we gather fine-grained system information?

strace tool can be used to trace system calls made by a process and signalsit receives. System calls are the fundamental interface between anapplication and the operating system kernel. They enable a program torequest services from the kernel. For instance, the open() system call inLinux is used to provide access to a file in the file system. strace enablesus to track all the system calls made by an application. It lists all thesystem calls made by a process and their resulting output.

You can generate profiling data combining strace and perf record tools torecord the events and information associated with a process. This providesinsight into the process. “perf annotate” tool generates the statistics ofeach instruction of the program. This document goes over the details of howto gather fine-grained information on a workload’s usage of system resources.

We used strace to trace the perf, stress-ng, paxtest workloads to illustrateour methodology to discover resources used by a workload. This process canbe applied to trace other workloads.

Getting the system ready for tracing

Before we can get started we will show you how to get your system ready.We assume that you have a Linux distribution running on a physical systemor a virtual machine. Most distributions will include strace command. Let’sinstall other tools that aren’t usually included to build Linux kernel.Please note that the following works on Debian based distributions. Youmight have to find equivalent packages on other Linux distributions.

Install tools to build Linux kernel and tools in kernel repository.scripts/ver_linux is a good way to check if your system already hasthe necessary tools:

sudo apt-get install build-essential flex bison yaccsudo apt install libelf-dev systemtap-sdt-dev libslang2-dev libperl-dev libdw-dev

cscope is a good tool to browse kernel sources. Let’s install it now:

sudo apt-get install cscope

Install stress-ng and paxtest:

apt-get install stress-ngapt-get install paxtest

Workload overview

As mentioned earlier, we used strace to trace perf bench, stress-ng andpaxtest workloads to show how to analyze a workload and identify Linuxsubsystems used by these workloads. Let’s start with an overview of thesethree workloads to get a better understanding of what they do and how touse them.

perf bench (all) workload

The perf bench command contains multiple multi-threaded microkernelbenchmarks for executing different subsystems in the Linux kernel andsystem calls. This allows us to easily measure the impact of changes,which can help mitigate performance regressions. It also acts as a commonbenchmarking framework, enabling developers to easily create test cases,integrate transparently, and use performance-rich tooling subsystems.

Stress-ng netdev stressor workload

stress-ng is used for performing stress testing on the kernel. It allowsyou to exercise various physical subsystems of the computer, as well asinterfaces of the OS kernel, using “stressor-s”. They are available forCPU, CPU cache, devices, I/O, interrupts, file system, memory, network,operating system, pipelines, schedulers, and virtual machines. Please referto thestress-ng man-page tofind the description of all the available stressor-s. The netdev stressorstarts specified number (N) of workers that exercise various netdeviceioctl commands across all the available network devices.

paxtest kiddie workload

paxtest is a program that tests buffer overflows in the kernel. It testskernel enforcements over memory usage. Generally, execution in some memorysegments makes buffer overflows possible. It runs a set of programs thatattempt to subvert memory usage. It is used as a regression test suite forPaX, but might be useful to test other memory protection patches for thekernel. We used paxtest kiddie mode which looks for simple vulnerabilities.

What is strace and how do we use it?

As mentioned earlier, strace which is a useful diagnostic, instructional,and debugging tool and can be used to discover the system resources in useby a workload. It can be used:

  • To see how a process interacts with the kernel.

  • To see why a process is failing or hanging.

  • For reverse engineering a process.

  • To find the files on which a program depends.

  • For analyzing the performance of an application.

  • For troubleshooting various problems related to the operating system.

In addition, strace can generate run-time statistics on times, calls, anderrors for each system call and report a summary when program exits,suppressing the regular output. This attempts to show system time (CPU timespent running in the kernel) independent of wall clock time. We plan to usethese features to get information on workload system usage.

strace command supports basic, verbose, and stats modes. strace command whenrun in verbose mode gives more detailed information about the system callsinvoked by a process.

Running strace -c generates a report of the percentage of time spent in eachsystem call, the total time in seconds, the microseconds per call, the totalnumber of calls, the count of each system call that has failed with an errorand the type of system call made.

  • Usage: strace <command we want to trace>

  • Verbose mode usage: strace -v <command>

  • Gather statistics: strace -c <command>

We used the “-c” option to gather fine-grained run-time statistics in useby three workloads we have chose for this analysis.

  • perf

  • stress-ng

  • paxtest

What is cscope and how do we use it?

Now let’s look atcscope, a commandline tool for browsing C, C++ or Java code-bases. We can use it to findall the references to a symbol, global definitions, functions called by afunction, functions calling a function, text strings, regular expressionpatterns, files including a file.

We can use cscope to find which system call belongs to which subsystem.This way we can find the kernel subsystems used by a process when it isexecuted.

Let’s checkout the latest Linux repository and build cscope database:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linuxcd linuxcscope -R -p10  # builds cscope.out database before starting browse sessioncscope -d -p10  # starts browse session on cscope.out database

Note: Run “cscope -R -p10” to build the database and c”scope -d -p10” toenter into the browsing session. cscope by default cscope.out database.To get out of this mode press ctrl+d. -p option is used to specify thenumber of file path components to display. -p10 is optimal for browsingkernel sources.

What is perf and how do we use it?

Perf is an analysis tool based on Linux 2.6+ systems, which abstracts theCPU hardware difference in performance measurement in Linux, and providesa simple command line interface. Perf is based on the perf_events interfaceexported by the kernel. It is very useful for profiling the system andfinding performance bottlenecks in an application.

If you haven’t already checked out the Linux mainline repository, you can doso and then build kernel and perf tool:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linuxcd linuxmake -j3 allcd tools/perfmake

Note: The perf command can be built without building the kernel in therepository and can be run on older kernels. However matching the kerneland perf revisions gives more accurate information on the subsystem usage.

We used “perf stat” and “perf bench” options. For a detailed information onthe perf tool, run “perf -h”.

perf stat

The perf stat command generates a report of various hardware and softwareevents. It does so with the help of hardware counter registers found inmodern CPUs that keep the count of these activities. “perf stat cal” showsstats for cal command.

Perf bench

The perf bench command contains multiple multi-threaded microkernelbenchmarks for executing different subsystems in the Linux kernel andsystem calls. This allows us to easily measure the impact of changes,which can help mitigate performance regressions. It also acts as a commonbenchmarking framework, enabling developers to easily create test cases,integrate transparently, and use performance-rich tooling.

“perf bench all” command runs the following benchmarks:

  • sched/messaging

  • sched/pipe

  • syscall/basic

  • mem/memcpy

  • mem/memset

What is stress-ng and how do we use it?

As mentioned earlier, stress-ng is used for performing stress testing onthe kernel. It allows you to exercise various physical subsystems of thecomputer, as well as interfaces of the OS kernel, using stressor-s. Theyare available for CPU, CPU cache, devices, I/O, interrupts, file system,memory, network, operating system, pipelines, schedulers, and virtualmachines.

The netdev stressor starts N workers that exercise various netdevice ioctlcommands across all the available network devices. The following ioctls areexercised:

  • SIOCGIFCONF, SIOCGIFINDEX, SIOCGIFNAME, SIOCGIFFLAGS

  • SIOCGIFADDR, SIOCGIFNETMASK, SIOCGIFMETRIC, SIOCGIFMTU

  • SIOCGIFHWADDR, SIOCGIFMAP, SIOCGIFTXQLEN

The following command runs the stressor:

stress-ng --netdev 1 -t 60 --metrics command.

We can use the perf record command to record the events and informationassociated with a process. This command records the profiling data in theperf.data file in the same directory.

Using the following commands you can record the events associated with thenetdev stressor, view the generated report perf.data and annotate the toview the statistics of each instruction of the program:

perf record stress-ng --netdev 1 -t 60 --metrics command.perf reportperf annotate

What is paxtest and how do we use it?

paxtest is a program that tests buffer overflows in the kernel. It testskernel enforcements over memory usage. Generally, execution in some memorysegments makes buffer overflows possible. It runs a set of programs thatattempt to subvert memory usage. It is used as a regression test suite forPaX, and will be useful to test other memory protection patches for thekernel.

paxtest provides kiddie and blackhat modes. The paxtest kiddie mode runsin normal mode, whereas the blackhat mode tries to get around the protectionof the kernel testing for vulnerabilities. We focus on the kiddie mode hereand combine “paxtest kiddie” run with “perf record” to collect CPU stacktraces for the paxtest kiddie run to see which function is calling otherfunctions in the performance profile. Then the “dwarf” (DWARF’s Call FrameInformation) mode can be used to unwind the stack.

The following command can be used to view resulting report in call-graphformat:

perf record --call-graph dwarf paxtest kiddieperf report --stdio

Tracing workloads

Now that we understand the workloads, let’s start tracing them.

Tracing perf bench all workload

Run the following command to trace perf bench all workload:

strace -c perf bench all

System Calls made by the workload

The below table shows the system calls invoked by the workload, number oftimes each system call is invoked, and the corresponding Linux subsystem.

System Call

# calls

Linux Subsystem

System Call (API)

getppid

10000001

Process Mgmt

sys_getpid()

clone

1077

Process Mgmt.

sys_clone()

prctl

23

Process Mgmt.

sys_prctl()

prlimit64

7

Process Mgmt.

sys_prlimit64()

getpid

10

Process Mgmt.

sys_getpid()

uname

3

Process Mgmt.

sys_uname()

sysinfo

1

Process Mgmt.

sys_sysinfo()

getuid

1

Process Mgmt.

sys_getuid()

getgid

1

Process Mgmt.

sys_getgid()

geteuid

1

Process Mgmt.

sys_geteuid()

getegid

1

Process Mgmt.

sys_getegid

close

49951

Filesystem

sys_close()

pipe

604

Filesystem

sys_pipe()

openat

48560

Filesystem

sys_opennat()

fstat

8338

Filesystem

sys_fstat()

stat

1573

Filesystem

sys_stat()

pread64

9646

Filesystem

sys_pread64()

getdents64

1873

Filesystem

sys_getdents64()

access

3

Filesystem

sys_access()

lstat

1880

Filesystem

sys_lstat()

lseek

6

Filesystem

sys_lseek()

ioctl

3

Filesystem

sys_ioctl()

dup2

1

Filesystem

sys_dup2()

execve

2

Filesystem

sys_execve()

fcntl

8779

Filesystem

sys_fcntl()

statfs

1

Filesystem

sys_statfs()

epoll_create

2

Filesystem

sys_epoll_create()

epoll_ctl

64

Filesystem

sys_epoll_ctl()

newfstatat

8318

Filesystem

sys_newfstatat()

eventfd2

192

Filesystem

sys_eventfd2()

mmap

243

Memory Mgmt.

sys_mmap()

mprotect

32

Memory Mgmt.

sys_mprotect()

brk

21

Memory Mgmt.

sys_brk()

munmap

128

Memory Mgmt.

sys_munmap()

set_mempolicy

156

Memory Mgmt.

sys_set_mempolicy()

set_tid_address

1

Process Mgmt.

sys_set_tid_address()

set_robust_list

1

Futex

sys_set_robust_list()

futex

341

Futex

sys_futex()

sched_getaffinity

79

Scheduler

sys_sched_getaffinity()

sched_setaffinity

223

Scheduler

sys_sched_setaffinity()

socketpair

202

Network

sys_socketpair()

rt_sigprocmask

21

Signal

sys_rt_sigprocmask()

rt_sigaction

36

Signal

sys_rt_sigaction()

rt_sigreturn

2

Signal

sys_rt_sigreturn()

wait4

889

Time

sys_wait4()

clock_nanosleep

37

Time

sys_clock_nanosleep()

capget

4

Capability

sys_capget()

Tracing stress-ng netdev stressor workload

Run the following command to trace stress-ng netdev stressor workload:

strace -c  stress-ng --netdev 1 -t 60 --metrics

System Calls made by the workload

The below table shows the system calls invoked by the workload, number oftimes each system call is invoked, and the corresponding Linux subsystem.

System Call

# calls

Linux Subsystem

System Call (API)

openat

74

Filesystem

sys_openat()

close

75

Filesystem

sys_close()

read

58

Filesystem

sys_read()

fstat

20

Filesystem

sys_fstat()

flock

10

Filesystem

sys_flock()

write

7

Filesystem

sys_write()

getdents64

8

Filesystem

sys_getdents64()

pread64

8

Filesystem

sys_pread64()

lseek

1

Filesystem

sys_lseek()

access

2

Filesystem

sys_access()

getcwd

1

Filesystem

sys_getcwd()

execve

1

Filesystem

sys_execve()

mmap

61

Memory Mgmt.

sys_mmap()

munmap

3

Memory Mgmt.

sys_munmap()

mprotect

20

Memory Mgmt.

sys_mprotect()

mlock

2

Memory Mgmt.

sys_mlock()

brk

3

Memory Mgmt.

sys_brk()

rt_sigaction

21

Signal

sys_rt_sigaction()

rt_sigprocmask

1

Signal

sys_rt_sigprocmask()

sigaltstack

1

Signal

sys_sigaltstack()

rt_sigreturn

1

Signal

sys_rt_sigreturn()

getpid

8

Process Mgmt.

sys_getpid()

prlimit64

5

Process Mgmt.

sys_prlimit64()

arch_prctl

2

Process Mgmt.

sys_arch_prctl()

sysinfo

2

Process Mgmt.

sys_sysinfo()

getuid

2

Process Mgmt.

sys_getuid()

uname

1

Process Mgmt.

sys_uname()

setpgid

1

Process Mgmt.

sys_setpgid()

getrusage

1

Process Mgmt.

sys_getrusage()

geteuid

1

Process Mgmt.

sys_geteuid()

getppid

1

Process Mgmt.

sys_getppid()

sendto

3

Network

sys_sendto()

connect

1

Network

sys_connect()

socket

1

Network

sys_socket()

clone

1

Process Mgmt.

sys_clone()

set_tid_address

1

Process Mgmt.

sys_set_tid_address()

wait4

2

Time

sys_wait4()

alarm

1

Time

sys_alarm()

set_robust_list

1

Futex

sys_set_robust_list()

Tracing paxtest kiddie workload

Run the following command to trace paxtest kiddie workload:

strace -c paxtest kiddie

System Calls made by the workload

The below table shows the system calls invoked by the workload, number oftimes each system call is invoked, and the corresponding Linux subsystem.

System Call

# calls

Linux Subsystem

System Call (API)

read

3

Filesystem

sys_read()

write

11

Filesystem

sys_write()

close

41

Filesystem

sys_close()

stat

24

Filesystem

sys_stat()

fstat

2

Filesystem

sys_fstat()

pread64

6

Filesystem

sys_pread64()

access

1

Filesystem

sys_access()

pipe

1

Filesystem

sys_pipe()

dup2

24

Filesystem

sys_dup2()

execve

1

Filesystem

sys_execve()

fcntl

26

Filesystem

sys_fcntl()

openat

14

Filesystem

sys_openat()

rt_sigaction

7

Signal

sys_rt_sigaction()

rt_sigreturn

38

Signal

sys_rt_sigreturn()

clone

38

Process Mgmt.

sys_clone()

wait4

44

Time

sys_wait4()

mmap

7

Memory Mgmt.

sys_mmap()

mprotect

3

Memory Mgmt.

sys_mprotect()

munmap

1

Memory Mgmt.

sys_munmap()

brk

3

Memory Mgmt.

sys_brk()

getpid

1

Process Mgmt.

sys_getpid()

getuid

1

Process Mgmt.

sys_getuid()

getgid

1

Process Mgmt.

sys_getgid()

geteuid

2

Process Mgmt.

sys_geteuid()

getegid

1

Process Mgmt.

sys_getegid()

getppid

1

Process Mgmt.

sys_getppid()

arch_prctl

2

Process Mgmt.

sys_arch_prctl()

Conclusion

This document is intended to be used as a guide on how to gather fine-grainedinformation on the resources in use by workloads using strace.

References