English

Chinese (Simplified)

Workqueue¶

Date:: September, 2010
Author:: Tejun Heo <tj@kernel.org>
Author:: Florian Mickler <florian@mickler.org>

Introduction¶

There are many cases where an asynchronous process execution contextis needed and the workqueue (wq) API is the most commonly usedmechanism for such cases.

When such an asynchronous execution context is needed, a work itemdescribing which function to execute is put on a queue. Anindependent thread serves as the asynchronous execution context. Thequeue is called workqueue and the thread is called worker.

While there are work items on the workqueue the worker executes thefunctions associated with the work items one after the other. Whenthere is no work item left on the workqueue the worker becomes idle.When a new work item gets queued, the worker begins executing again.

Why Concurrency Managed Workqueue?¶

In the original wq implementation, a multi threaded (MT) wq had oneworker thread per CPU and a single threaded (ST) wq had one workerthread system-wide. A single MT wq needed to keep around the samenumber of workers as the number of CPUs. The kernel grew a lot of MTwq users over the years and with the number of CPU cores continuouslyrising, some systems saturated the default 32k PID space just bootingup.

Although MT wq wasted a lot of resource, the level of concurrencyprovided was unsatisfactory. The limitation was common to both ST andMT wq albeit less severe on MT. Each wq maintained its own separateworker pool. An MT wq could provide only one execution context per CPUwhile an ST wq one for the whole system. Work items had to compete forthose very limited execution contexts leading to various problemsincluding proneness to deadlocks around the single execution context.

The tension between the provided level of concurrency and resourceusage also forced its users to make unnecessary tradeoffs like libatachoosing to use ST wq for polling PIOs and accepting an unnecessarylimitation that no two polling PIOs can progress at the same time. AsMT wq don’t provide much better concurrency, users which requirehigher level of concurrency, like async or fscache, had to implementtheir own thread pool.

Concurrency Managed Workqueue (cmwq) is a reimplementation of wq withfocus on the following goals.

Maintain compatibility with the original workqueue API.
Use per-CPU unified worker pools shared by all wq to provideflexible level of concurrency on demand without wasting a lot ofresource.
Automatically regulate worker pool and level of concurrency so thatthe API users don’t need to worry about such details.

The Design¶

In order to ease the asynchronous execution of functions a newabstraction, the work item, is introduced.

A work item is a simplestructthat holds a pointer to the functionthat is to be executed asynchronously. Whenever a driver or subsystemwants a function to be executed asynchronously it has to set up a workitem pointing to that function and queue that work item on aworkqueue.

A work item can be executed in either a thread or the BH (softirq) context.

For threaded workqueues, special purpose threads, called [k]workers, executethe functions off of the queue, one after the other. If no work is queued,the worker threads become idle. These worker threads are managed inworker-pools.

The cmwq design differentiates between the user-facing workqueues thatsubsystems and drivers queue work items on and the backend mechanismwhich manages worker-pools and processes the queued work items.

There are two worker-pools, one for normal work items and the otherfor high priority ones, for each possible CPU and some extraworker-pools to serve work items queued on unbound workqueues - thenumber of these backing pools is dynamic.

BH workqueues use the same framework. However, as there can only be oneconcurrent execution context, there’s no need to worry about concurrency.Each per-CPU BH worker pool contains only one pseudo worker which representsthe BH execution context. A BH workqueue can be considered a convenienceinterface to softirq.

Subsystems and drivers can create and queue work items through specialworkqueue API functions as they see fit. They can influence someaspects of the way the work items are executed by setting flags on theworkqueue they are putting the work item on. These flags includethings like CPU locality, concurrency limits, priority and more. Toget a detailed overview refer to the API description ofalloc_workqueue() below.

When a work item is queued to a workqueue, the target worker-pool isdetermined according to the queue parameters and workqueue attributesand appended on the shared worklist of the worker-pool. For example,unless specifically overridden, a work item of a bound workqueue willbe queued on the worklist of either normal or highpri worker-pool thatis associated to the CPU the issuer is running on.

For any thread pool implementation, managing the concurrency level(how many execution contexts are active) is an important issue. cmwqtries to keep the concurrency at a minimal but sufficient level.Minimal to save resources and sufficient in that the system is used atits full capacity.

Each worker-pool bound to an actual CPU implements concurrencymanagement by hooking into the scheduler. The worker-pool is notifiedwhenever an active worker wakes up or sleeps and keeps track of thenumber of the currently runnable workers. Generally, work items arenot expected to hog a CPU and consume many cycles. That meansmaintaining just enough concurrency to prevent work processing fromstalling should be optimal. As long as there are one or more runnableworkers on the CPU, the worker-pool doesn’t start execution of a newwork, but, when the last running worker goes to sleep, it immediatelyschedules a new worker so that the CPU doesn’t sit idle while thereare pending work items. This allows using a minimal number of workerswithout losing execution bandwidth.

Keeping idle workers around doesn’t cost other than the memory spacefor kthreads, so cmwq holds onto idle ones for a while before killingthem.

For unbound workqueues, the number of backing pools is dynamic.Unbound workqueue can be assigned custom attributes usingapply_workqueue_attrs() and workqueue will automatically createbacking worker pools matching the attributes. The responsibility ofregulating concurrency level is on the users. There is also a flag tomark a bound wq to ignore the concurrency management. Please refer tothe API section for details.

Forward progress guarantee relies on that workers can be created whenmore execution contexts are necessary, which in turn is guaranteedthrough the use of rescue workers. All work items which might be usedon code paths that handle memory reclaim are required to be queued onwq’s that have a rescue-worker reserved for execution under memorypressure. Else it is possible that the worker-pool deadlocks waitingfor execution contexts to free up.

Application Programming Interface (API)¶

alloc_workqueue() allocates a wq. The originalcreate_*workqueue() functions are deprecated and scheduled forremoval.alloc_workqueue() takes three arguments -@name,@flags and@max_active.@name is the name of the wq andalso used as the name of the rescuer thread if there is one.

A wq no longer manages execution resources but serves as a domain forforward progress guarantee, flush and work item attributes.@flagsand@max_active control how work items are assigned executionresources, scheduled and executed.

`flags`¶

WQ_BH

BH workqueues can be considered a convenience interface to softirq. BHworkqueues are always per-CPU and all BH work items are executed in thequeueing CPU’s softirq context in the queueing order.

All BH workqueues must have 0max_active andWQ_HIGHPRI is theonly allowed additional flag.

BH work items cannot sleep. All other features such as delayed queueing,flushing and canceling are supported.

WQ_PERCPU

Work items queued to a per-cpu wq are bound to a specific CPU.This flag is the right choice when cpu locality is important.

This flag is the complement ofWQ_UNBOUND.

WQ_UNBOUND

Work items queued to an unbound wq are served by the specialworker-pools which host workers which are not bound to anyspecific CPU. This makes the wq behave as a simple executioncontext provider without concurrency management. The unboundworker-pools try to start execution of work items as soon aspossible. Unbound wq sacrifices locality but is useful forthe following cases.

Wide fluctuation in the concurrency level requirement isexpected and using bound wq may end up creating large numberof mostly unused workers across different CPUs as the issuerhops through different CPUs.
Long running CPU intensive workloads which can be bettermanaged by the system scheduler.

WQ_FREEZABLE

A freezable wq participates in the freeze phase of the systemsuspend operations. Work items on the wq are drained and nonew work item starts execution until thawed.

WQ_MEM_RECLAIM

All wq which might be used in the memory reclaim pathsMUSThave this flag set. The wq is guaranteed to have at least oneexecution context regardless of memory pressure.

WQ_HIGHPRI

Work items of a highpri wq are queued to the highpriworker-pool of the target cpu. Highpri worker-pools areserved by worker threads with elevated nice level.

Note that normal and highpri worker-pools don’t interact witheach other. Each maintains its separate pool of workers andimplements concurrency management among its workers.

WQ_CPU_INTENSIVE

Work items of a CPU intensive wq do not contribute to theconcurrency level. In other words, runnable CPU intensivework items will not prevent other work items in the sameworker-pool from starting execution. This is useful for boundwork items which are expected to hog CPU cycles so that theirexecution is regulated by the system scheduler.

Although CPU intensive work items don’t contribute to theconcurrency level, start of their executions is stillregulated by the concurrency management and runnablenon-CPU-intensive work items can delay execution of CPUintensive work items.

This flag is meaningless for unbound wq.

`max_active`¶

@max_active determines the maximum number of execution contexts perCPU which can be assigned to the work items of a wq. For example, with@max_active of 16, at most 16 work items of the wq can be executingat the same time per CPU. This is always a per-CPU attribute, even forunbound workqueues.

The maximum limit for@max_active is 2048 and the default value usedwhen 0 is specified is 1024. These values are chosen sufficiently highsuch that they are not the limiting factor while providing protection inrunaway cases.

The number of active work items of a wq is usually regulated by theusers of the wq, more specifically, by how many work items the usersmay queue at the same time. Unless there is a specific need forthrottling the number of active work items, specifying ‘0’ isrecommended.

Some users depend on strict execution ordering where only one work itemis in flight at any given time and the work items are processed inqueueing order. While the combination of@max_active of 1 andWQ_UNBOUND used to achieve this behavior, this is no longer thecase. Usealloc_ordered_workqueue() instead.

Example Execution Scenarios¶

The following example execution scenarios try to illustrate how cmwqbehave under different configurations.

Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU.w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5msagain before finishing. w1 and w2 burn CPU for 5ms then sleep for10ms.

Ignoring all other tasks, works and processing overhead, and assumingsimple FIFO scheduling, the following is one highly simplified versionof possible sequences of events with the original wq.

TIME IN MSECS  EVENT0              w0 starts and burns CPU5              w0 sleeps15             w0 wakes up and burns CPU20             w0 finishes20             w1 starts and burns CPU25             w1 sleeps35             w1 wakes up and finishes35             w2 starts and burns CPU40             w2 sleeps50             w2 wakes up and finishes

And with cmwq with@max_active >= 3,

TIME IN MSECS  EVENT0              w0 starts and burns CPU5              w0 sleeps5              w1 starts and burns CPU10             w1 sleeps10             w2 starts and burns CPU15             w2 sleeps15             w0 wakes up and burns CPU20             w0 finishes20             w1 wakes up and finishes25             w2 wakes up and finishes

If@max_active == 2,

TIME IN MSECS  EVENT0              w0 starts and burns CPU5              w0 sleeps5              w1 starts and burns CPU10             w1 sleeps15             w0 wakes up and burns CPU20             w0 finishes20             w1 wakes up and finishes20             w2 starts and burns CPU25             w2 sleeps35             w2 wakes up and finishes

Now, let’s assume w1 and w2 are queued to a different wq q1 which hasWQ_CPU_INTENSIVE set,

TIME IN MSECS  EVENT0              w0 starts and burns CPU5              w0 sleeps5              w1 and w2 start and burn CPU10             w1 sleeps15             w2 sleeps15             w0 wakes up and burns CPU20             w0 finishes20             w1 wakes up and finishes25             w2 wakes up and finishes

Guidelines¶

Do not forget to useWQ_MEM_RECLAIM if a wq may process workitems which are used during memory reclaim. Each wq withWQ_MEM_RECLAIM set has an execution context reserved for it. Ifthere is dependency among multiple work items used during memoryreclaim, they should be queued to separate wq each withWQ_MEM_RECLAIM.
Unless strict ordering is required, there is no need to use ST wq.
Unless there is a specific need, using 0 for @max_active isrecommended. In most use cases, concurrency level usually stayswell under the default limit.
A wq serves as a domain for forward progress guarantee(WQ_MEM_RECLAIM, flush and work item attributes. Work itemswhich are not involved in memory reclaim and don’t need to beflushed as a part of a group of work items, and don’t require anyspecial attribute, can use one of the system wq. There is nodifference in execution characteristics between using a dedicated wqand a system wq.
Note: If something may generate more than @max_active outstandingwork items (do stress test your producers), it may saturate a systemwq and potentially lead to deadlock. It should utilize its owndedicated workqueue rather than the system wq.
Unless work items are expected to consume a huge amount of CPUcycles, using a bound wq is usually beneficial due to the increasedlevel of locality in wq operations and work item execution.

Affinity Scopes¶

An unbound workqueue groups CPUs according to its affinity scope to improvecache locality. For example, if a workqueue is using the default affinityscope of “cache”, it will group CPUs according to last level cacheboundaries. A work item queued on the workqueue will be assigned to a workeron one of the CPUs which share the last level cache with the issuing CPU.Once started, the worker may or may not be allowed to move outside the scopedepending on theaffinity_strict setting of the scope.

Workqueue currently supports the following affinity scopes.

default: Use the scope in module parameterworkqueue.default_affinity_scopewhich is always set to one of the scopes below.
cpu: CPUs are not grouped. A work item issued on one CPU is processed by aworker on the same CPU. This makes unbound workqueues behave as per-cpuworkqueues without concurrency management.
smt: CPUs are grouped according to SMT boundaries. This usually means that thelogical threads of each physical CPU core are grouped together.
cache: CPUs are grouped according to cache boundaries. Which specific cacheboundary is used is determined by the arch code. L3 is used in a lot ofcases. This is the default affinity scope.
numa: CPUs are grouped according to NUMA boundaries.
system: All CPUs are put in the same group. Workqueue makes no effort to process awork item on a CPU close to the issuing CPU.

The default affinity scope can be changed with the module parameterworkqueue.default_affinity_scope and a specific workqueue’s affinityscope can be changed usingapply_workqueue_attrs().

IfWQ_SYSFS is set, the workqueue will have the following affinity scoperelated interface files under its/sys/devices/virtual/workqueue/WQ_NAME/directory.

affinity_scope

Read to see the current affinity scope. Write to change.

When default is the current scope, reading this file will also show thecurrent effective scope in parentheses, for example,default(cache).

affinity_strict

0 by default indicating that affinity scopes are not strict. When a workitem starts execution, workqueue makes a best-effort attempt to ensurethat the worker is inside its affinity scope, which is calledrepatriation. Once started, the scheduler is free to move the workeranywhere in the system as it sees fit. This enables benefiting from scopelocality while still being able to utilize other CPUs if necessary andavailable.

If set to 1, all workers of the scope are guaranteed always to be in thescope. This may be useful when crossing affinity scopes has otherimplications, for example, in terms of power consumption or workloadisolation. Strict NUMA scope can also be used to match the workqueuebehavior of older kernels.

Affinity Scopes and Performance¶

It’d be ideal if an unbound workqueue’s behavior is optimal for vastmajority of use cases without further tuning. Unfortunately, in the currentkernel, there exists a pronounced trade-off between locality and utilizationnecessitating explicit configurations when workqueues are heavily used.

Higher locality leads to higher efficiency where more work is performed forthe same number of consumed CPU cycles. However, higher locality may alsocause lower overall system utilization if the work items are not spreadenough across the affinity scopes by the issuers. The following performancetesting with dm-crypt clearly illustrates this trade-off.

The tests are run on a CPU with 12-cores/24-threads split across four L3caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency./dev/dm-0 is a dm-crypt device created on NVME SSD (Samsung 990 PRO) andopened withcryptsetup with default settings.

Scenario 1: Enough issuers and work spread across the machine¶

The command used:

$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \  --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \  --name=iops-test-job --verify=sha512

There are 24 issuers, each issuing 64 IOs concurrently.--verify=sha512makesfio generate and read back the content each time which makesexecution locality matter between the issuer andkcryptd. The followingare the read bandwidths and CPU utilizations depending on different affinityscope settings onkcryptd measured over five runs. Bandwidths are inMiBps, and CPU util in percents.

Affinity	Bandwidth (MiBps)	CPU util (%)
system	1159.40 ±1.34	99.31 ±0.02
cache	1166.40 ±0.89	99.34 ±0.01
cache (strict)	1166.00 ±0.71	99.35 ±0.01

With enough issuers spread across the system, there is no downside to“cache”, strict or otherwise. All three configurations saturate the wholemachine but the cache-affine ones outperform by 0.6% thanks to improvedlocality.

Scenario 2: Fewer issuers, enough work for saturation¶

The command used:

$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \  --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \  --time_based --group_reporting --name=iops-test-job --verify=sha512

The only difference from the previous scenario is--numjobs=8. There area third of the issuers but is still enough total work to saturate thesystem.

Affinity	Bandwidth (MiBps)	CPU util (%)
system	1155.40 ±0.89	97.41 ±0.05
cache	1154.40 ±1.14	96.15 ±0.09
cache (strict)	1112.00 ±4.64	93.26 ±0.35

This is more than enough work to saturate the system. Both “system” and“cache” are nearly saturating the machine but not fully. “cache” is usingless CPU but the better efficiency puts it at the same bandwidth as“system”.

Eight issuers moving around over four L3 cache scope still allow “cache(strict)” to mostly saturate the machine but the loss of work conservationis now starting to hurt with 3.7% bandwidth loss.

Scenario 3: Even fewer issuers, not enough work to saturate¶

The command used:

$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \  --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \  --time_based --group_reporting --name=iops-test-job --verify=sha512

Again, the only difference is--numjobs=4. With the number of issuersreduced to four, there now isn’t enough work to saturate the whole systemand the bandwidth becomes dependent on completion latencies.

Affinity	Bandwidth (MiBps)	CPU util (%)
system	993.60 ±1.82	75.49 ±0.06
cache	973.40 ±1.52	74.90 ±0.07
cache (strict)	828.20 ±4.49	66.84 ±0.29

Now, the tradeoff between locality and utilization is clearer. “cache” shows2% bandwidth loss compared to “system” and “cache (struct)” whopping 20%.

Conclusion and Recommendations¶

In the above experiments, the efficiency advantage of the “cache” affinityscope over “system” is, while consistent and noticeable, small. However, theimpact is dependent on the distances between the scopes and may be morepronounced in processors with more complex topologies.

While the loss of work-conservation in certain scenarios hurts, it is a lotbetter than “cache (strict)” and maximizing workqueue utilization isunlikely to be the common case anyway. As such, “cache” is the defaultaffinity scope for unbound pools.

As there is no one option which is great for most cases, workqueue usagesthat may consume a significant amount of CPU are recommended to configurethe workqueues usingapply_workqueue_attrs() and/or enableWQ_SYSFS.
An unbound workqueue with strict “cpu” affinity scope behaves the same asWQ_CPU_INTENSIVE per-cpu workqueue. There is no real advanage to thelatter and an unbound workqueue provides a lot more flexibility.
Affinity scopes are introduced in Linux v6.5. To emulate the previousbehavior, use strict “numa” affinity scope.
The loss of work-conservation in non-strict affinity scopes is likelyoriginating from the scheduler. There is no theoretical reason why thekernel wouldn’t be able to do the right thing and maintainwork-conservation in most cases. As such, it is possible that futurescheduler improvements may make most of these tunables unnecessary.

Examining Configuration¶

Use tools/workqueue/wq_dump.py to examine unbound CPU affinityconfiguration, worker pools and how workqueues map to the pools:

$ tools/workqueue/wq_dump.pyAffinity Scopes===============wq_unbound_cpumask=0000000fCPU  nr_pods  4  pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008  pod_node [0]=0 [1]=0 [2]=1 [3]=1  cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3SMT  nr_pods  4  pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008  pod_node [0]=0 [1]=0 [2]=1 [3]=1  cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3CACHE (default)  nr_pods  2  pod_cpus [0]=00000003 [1]=0000000c  pod_node [0]=0 [1]=1  cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1NUMA  nr_pods  2  pod_cpus [0]=00000003 [1]=0000000c  pod_node [0]=0 [1]=1  cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1SYSTEM  nr_pods  1  pod_cpus [0]=0000000f  pod_node [0]=-1  cpu_pod  [0]=0 [1]=0 [2]=0 [3]=0Worker Pools============pool[00] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  0pool[01] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  0pool[02] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  1pool[03] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  1pool[04] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  2pool[05] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  2pool[06] ref= 1 nice=  0 idle/workers=  3/  3 cpu=  3pool[07] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  3pool[08] ref=42 nice=  0 idle/workers=  6/  6 cpus=0000000fpool[09] ref=28 nice=  0 idle/workers=  3/  3 cpus=00000003pool[10] ref=28 nice=  0 idle/workers= 17/ 17 cpus=0000000cpool[11] ref= 1 nice=-20 idle/workers=  1/  1 cpus=0000000fpool[12] ref= 2 nice=-20 idle/workers=  1/  1 cpus=00000003pool[13] ref= 2 nice=-20 idle/workers=  1/  1 cpus=0000000cWorkqueue CPU -> pool=====================[    workqueue \ CPU              0  1  2  3 dfl]events                   percpu   0  2  4  6events_highpri           percpu   1  3  5  7events_long              percpu   0  2  4  6events_unbound           unbound  9  9 10 10  8events_freezable         percpu   0  2  4  6events_power_efficient   percpu   0  2  4  6events_freezable_pwr_ef  percpu   0  2  4  6rcu_gp                   percpu   0  2  4  6rcu_par_gp               percpu   0  2  4  6slub_flushwq             percpu   0  2  4  6netns                    ordered  8  8  8  8  8...

See the command’s help message for more info.

Monitoring¶

Use tools/workqueue/wq_monitor.py to monitor workqueue operations:

$ tools/workqueue/wq_monitor.py events                            total  infl  CPUtime  CPUhog CMW/RPR  mayday rescuedevents                      18545     0      6.1       0       5       -       -events_highpri                  8     0      0.0       0       0       -       -events_long                     3     0      0.0       0       0       -       -events_unbound              38306     0      0.1       -       7       -       -events_freezable                0     0      0.0       0       0       -       -events_power_efficient      29598     0      0.2       0       0       -       -events_freezable_pwr_ef        10     0      0.0       0       0       -       -sock_diag_events                0     0      0.0       0       0       -       -                            total  infl  CPUtime  CPUhog CMW/RPR  mayday rescuedevents                      18548     0      6.1       0       5       -       -events_highpri                  8     0      0.0       0       0       -       -events_long                     3     0      0.0       0       0       -       -events_unbound              38322     0      0.1       -       7       -       -events_freezable                0     0      0.0       0       0       -       -events_power_efficient      29603     0      0.2       0       0       -       -events_freezable_pwr_ef        10     0      0.0       0       0       -       -sock_diag_events                0     0      0.0       0       0       -       -...

See the command’s help message for more info.

Debugging¶

Because the work functions are executed by generic worker threadsthere are a few tricks needed to shed some light on misbehavingworkqueue users.

Worker threads show up in the process list as:

root      5671  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/0:1]root      5672  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/1:2]root      5673  0.0  0.0      0     0 ?        S    12:12   0:00 [kworker/0:0]root      5674  0.0  0.0      0     0 ?        S    12:13   0:00 [kworker/1:0]

If kworkers are going crazy (using too much cpu), there are two typesof possible problems:

Something being scheduled in rapid succession
A single work item that consumes lots of cpu cycles

The first one can be tracked using tracing:

$ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event$ cat /sys/kernel/tracing/trace_pipe > out.txt(wait a few secs)^C

If something is busy looping on work queueing, it would be dominatingthe output and the offender can be determined with the work itemfunction.

For the second type of problems it should be possible to just checkthe stack trace of the offending worker thread.

$ cat /proc/THE_OFFENDING_KWORKER/stack

The work item’s function should be trivially visible in the stacktrace.

Non-reentrance Conditions¶

Workqueue guarantees that a work item cannot be re-entrant if the followingconditions hold after a work item gets queued:

The work function hasn’t been changed.
No one queues the work item to another workqueue.
The work item hasn’t been reinitiated.

In other words, if the above conditions hold, the work item is guaranteed to beexecuted by at most one worker system-wide at any given time.

Note that requeuing the work item (to the same queue) in the self functiondoesn’t break these conditions, so it’s safe to do. Otherwise, caution isrequired when breaking the conditions inside a work function.

Kernel Inline Documentations Reference¶

structworkqueue_attrs¶: A struct for workqueue attributes.

Definition:

struct workqueue_attrs {    int nice;    cpumask_var_t cpumask;    cpumask_var_t __pod_cpumask;    bool affn_strict;    enum wq_affn_scope affn_scope;    bool ordered;};

Members

nice

nice level

cpumask

allowed CPUs

Work items in this workqueue are affine to these CPUs and not allowedto execute on other CPUs. A pool serving a workqueue must have thesamecpumask.

__pod_cpumask

internal attribute used to create per-pod pools

Internal use only.

Per-pod unbound worker pools are used to improve locality. Always asubset of ->cpumask. A workqueue can be associated with multipleworker pools with disjoint__pod_cpumask’s. Whether the enforcementof a pool’s__pod_cpumask is strict depends onaffn_strict.

affn_strict

affinity scope is strict

If clear, workqueue will make a best-effort attempt at starting theworker inside__pod_cpumask but the scheduler is free to migrate itoutside.

If set, workers are only allowed to run inside__pod_cpumask.

affn_scope

unbound CPU affinity scope

CPU pods are used to improve execution locality of unbound workitems. There are multiple pod types, one for each wq_affn_scope, andevery CPU in the system belongs to one pod in every pod type. CPUsthat belong to the same pod share the worker pool. For example,selectingWQ_AFFN_NUMA makes the workqueue use a separate workerpool for each NUMA node.

ordered

work items must be executed one by one in queueing order

Description

This can be used to change attributes of an unbound workqueue.

work_pending¶

work_pending(work)

Find out whether a work item is currently pending

Parameters

work: The work item in question

delayed_work_pending¶

delayed_work_pending(w)

Find out whether a delayable work item is currently pending

Parameters

w: The work item in question

structworkqueue_struct*alloc_workqueue(constchar*fmt,unsignedintflags,intmax_active,...)¶: allocate a workqueue

Parameters

constchar*fmt: printf format for the name of the workqueue
unsignedintflags: WQ_* flags
intmax_active: max in-flight work items, 0 for default
...: args forfmt

Description

For a per-cpu workqueue,max_active limits the number of in-flight workitems for each CPU. e.g.max_active of 1 indicates that each CPU can beexecuting at most one work item for the workqueue.

For unbound workqueues,max_active limits the number of in-flight work itemsfor the whole system. e.g.max_active of 16 indicates that there can beat most 16 work items executing for the workqueue in the whole system.

As sharing the same active counter for an unbound workqueue across multipleNUMA nodes can be expensive,max_active is distributed to each NUMA nodeaccording to the proportion of the number of online CPUs and enforcedindependently.

Depending on online CPU distribution, a node may end up with per-nodemax_active which is significantly lower thanmax_active, which can lead todeadlocks if the per-node concurrency limit is lower than the maximum numberof interdependent work items for the workqueue.

To guarantee forward progress regardless of online CPU distribution, theconcurrency limit on every node is guaranteed to be equal to or greater thanmin_active which is set to min(max_active,WQ_DFL_MIN_ACTIVE). This meansthat the sum of per-node max_active’s may be larger thanmax_active.

For detailed information onWQ_* flags, please refer toWorkqueue.

Return

Pointer to the allocated workqueue on success,NULL on failure.

structworkqueue_struct*alloc_workqueue_lockdep_map(constchar*fmt,unsignedintflags,intmax_active,structlockdep_map*lockdep_map,...)¶: allocate a workqueue with user-defined lockdep_map

Parameters

constchar*fmt: printf format for the name of the workqueue
unsignedintflags: WQ_* flags
intmax_active: max in-flight work items, 0 for default
structlockdep_map*lockdep_map: user-defined lockdep_map
...: args forfmt

Description

Same as alloc_workqueue but with the a user-define lockdep_map. Useful forworkqueues created with the same purpose and to avoid leaking a lockdep_mapon each workqueue creation.

Return

Pointer to the allocated workqueue on success,NULL on failure.

alloc_ordered_workqueue_lockdep_map¶

alloc_ordered_workqueue_lockdep_map(fmt,flags,lockdep_map,args...)

allocate an ordered workqueue with user-defined lockdep_map

Parameters

fmt: printf format for the name of the workqueue
flags: WQ_* flags (only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful)
lockdep_map: user-defined lockdep_map
args...: args forfmt

Description

Same as alloc_ordered_workqueue but with the a user-define lockdep_map.Useful for workqueues created with the same purpose and to avoid leaking alockdep_map on each workqueue creation.

Return

Pointer to the allocated workqueue on success,NULL on failure.

alloc_ordered_workqueue¶

alloc_ordered_workqueue(fmt,flags,args...)

allocate an ordered workqueue

Parameters

fmt: printf format for the name of the workqueue
flags: WQ_* flags (only WQ_FREEZABLE and WQ_MEM_RECLAIM are meaningful)
args...: args forfmt

Description

Allocate an ordered workqueue. An ordered workqueue executes atmost one work item at any given time in the queued order. They areimplemented as unbound workqueues withmax_active of one.

Return

Pointer to the allocated workqueue on success,NULL on failure.

boolqueue_work(structworkqueue_struct*wq,structwork_struct*work)¶: queue work on a workqueue

Parameters

structworkqueue_struct*wq: workqueue to use
structwork_struct*work: work to queue

Description

Returnsfalse ifwork was already on a queue,true otherwise.

We queue the work to the CPU on which it was submitted, but if the CPU diesit can be processed by another CPU.

Memory-ordering properties: If it returnstrue, guarantees that all storespreceding the call toqueue_work() in the program order will be visible fromthe CPU which will executework by the time such work executes, e.g.,

{ x is initially 0 }

CPU0 CPU1
WRITE_ONCE(x, 1); [work is being executed ]r0 = queue_work(wq, work); r1 = READ_ONCE(x);

Forbids: r0 == true && r1 == 0

boolqueue_delayed_work(structworkqueue_struct*wq,structdelayed_work*dwork,unsignedlongdelay)¶: queue work on a workqueue after delay

Parameters

structworkqueue_struct*wq: workqueue to use
structdelayed_work*dwork: delayable work to queue
unsignedlongdelay: number of jiffies to wait before queueing

Description

Equivalent toqueue_delayed_work_on() but tries to use the local CPU.

boolmod_delayed_work(structworkqueue_struct*wq,structdelayed_work*dwork,unsignedlongdelay)¶: modify delay of or queue a delayed work

Parameters

structworkqueue_struct*wq: workqueue to use
structdelayed_work*dwork: work to queue
unsignedlongdelay: number of jiffies to wait before queueing

Description

mod_delayed_work_on() on local CPU.

boolschedule_work_on(intcpu,structwork_struct*work)¶: put work task on a specific cpu

Parameters

intcpu: cpu to put the work task on
structwork_struct*work: job to be done

Description

This puts a job on a specific cpu

boolschedule_work(structwork_struct*work)¶: put work task in global workqueue

Parameters

structwork_struct*work: job to be done

Description

Returnsfalse ifwork was already on the kernel-global workqueue andtrue otherwise.

This puts a job in the kernel-global workqueue if it was not alreadyqueued and leaves it in the same position on the kernel-globalworkqueue otherwise.

Shares the same memory-ordering properties ofqueue_work(), cf. theDocBook header ofqueue_work().

boolenable_and_queue_work(structworkqueue_struct*wq,structwork_struct*work)¶: Enable and queue a work item on a specific workqueue

Parameters

structworkqueue_struct*wq: The target workqueue
structwork_struct*work: The work item to be enabled and queued

Description

This function combines the operations ofenable_work() andqueue_work(),providing a convenient way to enable and queue a work item in a single call.It invokesenable_work() onwork and then queues it if the disable depthreached 0. Returnstrue if the disable depth reached 0 andwork is queued,andfalse otherwise.

Note thatwork is always queued when disable depth reaches zero. If thedesired behavior is queueing only if certain events took place whilework isdisabled, the user should implement the necessary state tracking and performexplicit conditional queueing afterenable_work().

boolschedule_delayed_work_on(intcpu,structdelayed_work*dwork,unsignedlongdelay)¶: queue work in global workqueue on CPU after delay

Parameters

intcpu: cpu to use
structdelayed_work*dwork: job to be done
unsignedlongdelay: number of jiffies to wait

Description

After waiting for a given time this puts a job in the kernel-globalworkqueue on the specified CPU.

boolschedule_delayed_work(structdelayed_work*dwork,unsignedlongdelay)¶: put work task in global workqueue after delay

Parameters

structdelayed_work*dwork: job to be done
unsignedlongdelay: number of jiffies to wait or 0 for immediate execution

Description

After waiting for a given time this puts a job in the kernel-globalworkqueue.

for_each_pool¶

for_each_pool(pool,pi)

iterate through all worker_pools in the system

Parameters

pool: iteration cursor
pi: integer used for iteration

Description

This must be called either with wq_pool_mutex held or RCU readlocked. If the pool needs to be used beyond the locking in effect, thecaller is responsible for guaranteeing that the pool stays online.

The if/else clause exists only for the lockdep assertion and can beignored.

for_each_pool_worker¶

for_each_pool_worker(worker,pool)

iterate through all workers of a worker_pool

Parameters

worker: iteration cursor
pool: worker_pool to iterate workers of

Description

This must be called with wq_pool_attach_mutex.

The if/else clause exists only for the lockdep assertion and can beignored.

for_each_pwq¶

for_each_pwq(pwq,wq)

iterate through all pool_workqueues of the specified workqueue

Parameters

pwq: iteration cursor
wq: the target workqueue

Description

This must be called either with wq->mutex held or RCU read locked.If the pwq needs to be used beyond the locking in effect, the caller isresponsible for guaranteeing that the pwq stays online.

The if/else clause exists only for the lockdep assertion and can beignored.

intworker_pool_assign_id(structworker_pool*pool)¶: allocate ID and assign it topool

Parameters

structworker_pool*pool: the pool pointer of interest

Description

Returns 0 if ID in [0, WORK_OFFQ_POOL_NONE) is allocated and assignedsuccessfully, -errno on failure.

structcpumask*unbound_effective_cpumask(structworkqueue_struct*wq)¶: effective cpumask of an unbound workqueue

Parameters

structworkqueue_struct*wq: workqueue of interest

Description

wq->unbound_attrs->cpumask contains the cpumask requested by the user whichis masked with wq_unbound_cpumask to determine the effective cpumask. Thedefault pwq is always mapped to the pool with the current effective cpumask.

structworker_pool*get_work_pool(structwork_struct*work)¶: return the worker_pool a given work was associated with

Parameters

structwork_struct*work: the work item of interest

Description

Pools are created and destroyed under wq_pool_mutex, and allows readaccess under RCU read lock. As such, this function should becalled under wq_pool_mutex or inside of arcu_read_lock() region.

All fields of the returned pool are accessible as long as the abovementioned locking is in effect. If the returned pool needs to be usedbeyond the critical section, the caller is responsible for ensuring thereturned pool is and stays online.

Return

The worker_poolwork was last associated with.NULL if none.

voidworker_set_flags(structworker*worker,unsignedintflags)¶: set worker flags and adjust nr_running accordingly

Parameters

structworker*worker: self
unsignedintflags: flags to set

Description

Setflags inworker->flags and adjust nr_running accordingly.

voidworker_clr_flags(structworker*worker,unsignedintflags)¶: clear worker flags and adjust nr_running accordingly

Parameters

structworker*worker: self
unsignedintflags: flags to clear

Description

Clearflags inworker->flags and adjust nr_running accordingly.

voidworker_enter_idle(structworker*worker)¶: enter idle state

Parameters

structworker*worker: worker which is entering idle state

Description

worker is entering idle state. Update stats and idle timer ifnecessary.

LOCKING:raw_spin_lock_irq(pool->lock).

voidworker_leave_idle(structworker*worker)¶: leave idle state

Parameters

structworker*worker: worker which is leaving idle state

Description

worker is leaving idle state. Update stats.

LOCKING:raw_spin_lock_irq(pool->lock).

structworker*find_worker_executing_work(structworker_pool*pool,structwork_struct*work)¶: find worker which is executing a work

Parameters

structworker_pool*pool: pool of interest
structwork_struct*work: work to find worker for

Description

Find a worker which is executingwork onpool by searchingpool->busy_hash which is keyed by the address ofwork. For a workerto match, its current execution should match the address ofwork andits work function. This is to avoid unwanted dependency betweenunrelated work executions through a work item being recycled while stillbeing executed.

This is a bit tricky. A work item may be freed once its executionstarts and nothing prevents the freed area from being recycled foranother work item. If the same work item address ends up being reusedbefore the original execution finishes, workqueue will identify therecycled work item as currently executing and make it wait until thecurrent execution finishes, introducing an unwanted dependency.

This function checks the work item address and work function to avoidfalse positives. Note that this isn’t complete as one may construct awork function which can introduce dependency onto itself through arecycled work item. Well, if somebody wants to shoot oneself in thefoot that badly, there’s only so much we can do, and if such deadlockactually occurs, it should be easy to locate the culprit work function.

Context

raw_spin_lock_irq(pool->lock).

Return

Pointer to worker which is executingwork if found,NULLotherwise.

voidmove_linked_works(structwork_struct*work,structlist_head*head,structwork_struct**nextp)¶: move linked works to a list

Parameters

structwork_struct*work: start of series of works to be scheduled
structlist_head*head: target list to appendwork to
structwork_struct**nextp: out parameter for nested worklist walking

Description

Schedule linked works starting fromwork tohead. Work series to bescheduled starts atwork and includes any consecutive work withWORK_STRUCT_LINKED set in its predecessor. Seeassign_work() for details onnextp.

Context

raw_spin_lock_irq(pool->lock).

boolassign_work(structwork_struct*work,structworker*worker,structwork_struct**nextp)¶: assign a work item and its linked work items to a worker

Parameters

structwork_struct*work: work to assign
structworker*worker: worker to assign to
structwork_struct**nextp: out parameter for nested worklist walking

Description

Assignwork and its linked work items toworker. Ifwork is already beingexecuted by another worker in the same pool, it’ll be punted there.

Ifnextp is not NULL, it’s updated to point to the next work of the lastscheduled work. This allowsassign_work() to be nested insidelist_for_each_entry_safe().

Returnstrue ifwork was successfully assigned toworker.false ifworkwas punted to another worker already executing it.

boolkick_pool(structworker_pool*pool)¶: wake up an idle worker if necessary

Parameters

structworker_pool*pool: pool to kick

Description

pool may have pending work items. Wake up worker if necessary. Returnswhether a worker was woken up.

voidwq_worker_running(structtask_struct*task)¶: a worker is running again

Parameters

structtask_struct*task: task waking up

Description

This function is called when a worker returns fromschedule()

voidwq_worker_sleeping(structtask_struct*task)¶: a worker is going to sleep

Parameters

structtask_struct*task: task going to sleep

Description

This function is called fromschedule() when a busy worker isgoing to sleep.

voidwq_worker_tick(structtask_struct*task)¶: a scheduler tick occurred while a kworker is running

Parameters

structtask_struct*task: task currently running

Description

Called fromsched_tick(). We’re in the IRQ context and the currentworker’s fields which follow the ‘K’ locking rule can be accessed safely.

work_func_twq_worker_last_func(structtask_struct*task)¶: retrieve worker’s last work function

Parameters

structtask_struct*task: Task to retrieve last work function of.

Description

Determine the last function a worker executed. This is called fromthe scheduler to get a worker’s last known identity.

This function is called duringschedule() when a kworker is goingto sleep. It’s used by psi to identify aggregation workers duringdequeuing, to allow periodic aggregation to shut-off when thatworker is the last task in the system or cgroup to go to sleep.

As this function doesn’t involve any workqueue-related locking, itonly returns stable values when called from inside the scheduler’squeuing and dequeuing paths, whentask, which must be a kworker,is guaranteed to not be processing any works.

Context

raw_spin_lock_irq(rq->lock)

Return

The last work functioncurrent executed as a worker, NULL if ithasn’t executed any work yet.

structwq_node_nr_active*wq_node_nr_active(structworkqueue_struct*wq,intnode)¶: Determine wq_node_nr_active to use

Parameters

structworkqueue_struct*wq: workqueue of interest
intnode: NUMA node, can beNUMA_NO_NODE

Description

Determine wq_node_nr_active to use forwq onnode. Returns:

NULL for per-cpu workqueues as they don’t need to use shared nr_active.
node_nr_active[nr_node_ids] ifnode isNUMA_NO_NODE.
Otherwise, node_nr_active[node].

voidwq_update_node_max_active(structworkqueue_struct*wq,intoff_cpu)¶: Update per-node max_actives to use

Parameters

structworkqueue_struct*wq: workqueue to update
intoff_cpu: CPU that’s going down, -1 if a CPU is not going down

Description

Updatewq->node_nr_active**[]->max. **wq must be unbound. max_active isdistributed among nodes according to the proportions of numbers of onlinecpus. The result is always betweenwq->min_active and max_active.

voidget_pwq(structpool_workqueue*pwq)¶: get an extra reference on the specified pool_workqueue

Parameters

structpool_workqueue*pwq: pool_workqueue to get

Description

Obtain an extra reference onpwq. The caller should guarantee thatpwq has positive refcnt and be holding the matching pool->lock.

voidput_pwq(structpool_workqueue*pwq)¶: put a pool_workqueue reference

Parameters

structpool_workqueue*pwq: pool_workqueue to put

Description

Drop a reference ofpwq. If its refcnt reaches zero, schedule itsdestruction. The caller should be holding the matching pool->lock.

voidput_pwq_unlocked(structpool_workqueue*pwq)¶: put_pwq() with surrounding pool lock/unlock

Parameters

structpool_workqueue*pwq: pool_workqueue to put (can beNULL)

Description

put_pwq() with locking. This function also allowsNULLpwq.

boolpwq_tryinc_nr_active(structpool_workqueue*pwq,boolfill)¶: Try to increment nr_active for a pwq

Parameters

structpool_workqueue*pwq: pool_workqueue of interest
boolfill: max_active may have increased, try to increase concurrency level

Description

Try to increment nr_active forpwq. Returnstrue if an nr_active count issuccessfully obtained.false otherwise.

boolpwq_activate_first_inactive(structpool_workqueue*pwq,boolfill)¶: Activate the first inactive work item on a pwq

Parameters

structpool_workqueue*pwq: pool_workqueue of interest
boolfill: max_active may have increased, try to increase concurrency level

Description

Activate the first inactive work item ofpwq if available and allowed bymax_active limit.

Returnstrue if an inactive work item has been activated.false if noinactive work item is found or max_active limit is reached.

voidunplug_oldest_pwq(structworkqueue_struct*wq)¶: unplug the oldest pool_workqueue

Parameters

structworkqueue_struct*wq: workqueue_struct where its oldest pwq is to be unplugged

Description

This function should only be called for ordered workqueues where only theoldest pwq is unplugged, the others are plugged to suspend execution toensure proper work item ordering:

dfl_pwq --------------+     [P] - plugged                      |                      vpwqs -> A -> B [P] -> C [P] (newest)        |    |        |        1    3        5        |    |        |        2    4        6

When the oldest pwq is drained and removed, this function should be calledto unplug the next oldest one to start its work item execution. Note thatpwq’s are linked into wq->pwqs with the oldest first, so the first one inthe list is the oldest.

voidnode_activate_pending_pwq(structwq_node_nr_active*nna,structworker_pool*caller_pool)¶: Activate a pending pwq on a wq_node_nr_active

Parameters

structwq_node_nr_active*nna: wq_node_nr_active to activate a pending pwq for
structworker_pool*caller_pool: worker_pool the caller is locking

Description

Activate a pwq innna->pending_pwqs. Called withcaller_pool locked.caller_pool may be unlocked and relocked to lock other worker_pools.

voidpwq_dec_nr_active(structpool_workqueue*pwq)¶: Retire an active count

Parameters

structpool_workqueue*pwq: pool_workqueue of interest

Description

Decrementpwq’s nr_active and try to activate the first inactive work item.For unbound workqueues, this function may temporarily droppwq->pool->lock.

voidpwq_dec_nr_in_flight(structpool_workqueue*pwq,unsignedlongwork_data)¶: decrement pwq’s nr_in_flight

Parameters

structpool_workqueue*pwq: pwq of interest
unsignedlongwork_data: work_data of work which left the queue

Description

A work either has completed or is removed from pending queue,decrement nr_in_flight of its pwq and handle workqueue flushing.

NOTE

For unbound workqueues, this function may temporarily droppwq->pool->lockand thus should be called after all other state updates for the in-flightwork item is complete.

Context

raw_spin_lock_irq(pool->lock).

inttry_to_grab_pending(structwork_struct*work,u32cflags,unsignedlong*irq_flags)¶: steal work item from worklist and disable irq

Parameters

structwork_struct*work: work item to steal
u32cflags: WORK_CANCEL_ flags
unsignedlong*irq_flags: place to store irq state

Description

Try to grab PENDING bit ofwork. This function can handlework in anystable state - idle, on timer or on worklist.

1
ifwork was pending and we successfully stole PENDING
0
ifwork was idle and we claimed PENDING
-EAGAIN
if PENDING couldn’t be grabbed at the moment, safe to busy-retry

Note

On >= 0 return, the caller ownswork’s PENDING bit. To avoid gettinginterrupted while holding PENDING andwork off queue, irq must bedisabled on entry. This, combined with delayed_work->timer beingirqsafe, ensures that we return -EAGAIN for finite short period of time.

On successful return, >= 0, irq is disabled and the caller isresponsible for releasing it using local_irq_restore(*irq_flags).

This function is safe to call from any context including IRQ handler.

boolwork_grab_pending(structwork_struct*work,u32cflags,unsignedlong*irq_flags)¶: steal work item from worklist and disable irq

Parameters

structwork_struct*work: work item to steal
u32cflags: WORK_CANCEL_ flags
unsignedlong*irq_flags: place to store IRQ state

Description

Grab PENDING bit ofwork.work can be in any stable state - idle, on timeror on worklist.

Can be called from any context. IRQ is disabled on return with IRQ statestored in*irq_flags. The caller is responsible for re-enabling it usinglocal_irq_restore().

Returnstrue ifwork was pending.false if idle.

voidinsert_work(structpool_workqueue*pwq,structwork_struct*work,structlist_head*head,unsignedintextra_flags)¶: insert a work into a pool

Parameters

structpool_workqueue*pwq: pwqwork belongs to
structwork_struct*work: work to insert
structlist_head*head: insertion point
unsignedintextra_flags: extra WORK_STRUCT_* flags to set

Description

Insertwork which belongs topwq afterhead.extra_flags is or’d towork_struct flags.

Context

raw_spin_lock_irq(pool->lock).

boolqueue_work_on(intcpu,structworkqueue_struct*wq,structwork_struct*work)¶: queue work on specific cpu

Parameters

intcpu: CPU number to execute work on
structworkqueue_struct*wq: workqueue to use
structwork_struct*work: work to queue

Description

We queue the work to a specific CPU, the caller must ensure itcan’t go away. Callers that fail to ensure that the specifiedCPU cannot go away will execute on a randomly chosen CPU.But note well that callers specifying a CPU that never has beenonline will get a splat.

Return

false ifwork was already on a queue,true otherwise.

intselect_numa_node_cpu(intnode)¶: Select a CPU based on NUMA node

Parameters

intnode: NUMA node ID that we want to select a CPU from

Description

This function will attempt to find a “random” cpu available on a givennode. If there are no CPUs available on the given node it will returnWORK_CPU_UNBOUND indicating that we should just schedule to anyavailable CPU if we need to schedule this work.

boolqueue_work_node(intnode,structworkqueue_struct*wq,structwork_struct*work)¶: queue work on a “random” cpu for a given NUMA node

Parameters

intnode: NUMA node that we are targeting the work for
structworkqueue_struct*wq: workqueue to use
structwork_struct*work: work to queue

Description

We queue the work to a “random” CPU within a given NUMA node. The basicidea here is to provide a way to somehow associate work with a givenNUMA node.

This function will only make a best effort attempt at getting this ontothe right NUMA node. If no node is requested or the requested node isoffline then we just fall back to standard queue_work behavior.

Currently the “random” CPU ends up being the first available CPU in theintersection of cpu_online_mask and the cpumask of the node, unless weare running on the node. In that case we just use the current CPU.

Return

false ifwork was already on a queue,true otherwise.

boolqueue_delayed_work_on(intcpu,structworkqueue_struct*wq,structdelayed_work*dwork,unsignedlongdelay)¶: queue work on specific CPU after delay

Parameters

intcpu: CPU number to execute work on
structworkqueue_struct*wq: workqueue to use
structdelayed_work*dwork: work to queue
unsignedlongdelay: number of jiffies to wait before queueing

Description

We queue the delayed_work to a specific CPU, for non-zero delays thecaller must ensure it is online and can’t go away. Callers that failto ensure this, may getdwork->timer queued to an offlined CPU andthis will prevent queueing ofdwork->work unless the offlined CPUbecomes online again.

Return

false ifwork was already on a queue,true otherwise. Ifdelay is zero anddwork is idle, it will be scheduled for immediateexecution.

boolmod_delayed_work_on(intcpu,structworkqueue_struct*wq,structdelayed_work*dwork,unsignedlongdelay)¶: modify delay of or queue a delayed work on specific CPU

Parameters

intcpu: CPU number to execute work on
structworkqueue_struct*wq: workqueue to use
structdelayed_work*dwork: work to queue
unsignedlongdelay: number of jiffies to wait before queueing

Description

Ifdwork is idle, equivalent toqueue_delayed_work_on(); otherwise,modifydwork’s timer so that it expires afterdelay. Ifdelay iszero,work is guaranteed to be scheduled immediately regardless of itscurrent state.

This function is safe to call from any context including IRQ handler.Seetry_to_grab_pending() for details.

Return

false ifdwork was idle and queued,true ifdwork waspending and its timer was modified.

boolqueue_rcu_work(structworkqueue_struct*wq,structrcu_work*rwork)¶: queue work after a RCU grace period

Parameters

structworkqueue_struct*wq: workqueue to use
structrcu_work*rwork: work to queue

Return

false ifrwork was already pending,true otherwise. Notethat a full RCU grace period is guaranteed only after atrue return.Whilerwork is guaranteed to be executed after afalse return, theexecution may happen before a full RCU grace period has passed.

voidworker_attach_to_pool(structworker*worker,structworker_pool*pool)¶: attach a worker to a pool

Parameters

structworker*worker: worker to be attached
structworker_pool*pool: the target pool

Description

Attachworker topool. Once attached, theWORKER_UNBOUND flag andcpu-binding ofworker are kept coordinated with the pool acrosscpu-[un]hotplugs.

voidworker_detach_from_pool(structworker*worker)¶: detach a worker from its pool

Parameters

structworker*worker: worker which is attached to its pool

Description

Undo the attaching which had been done inworker_attach_to_pool(). Thecaller worker shouldn’t access to the pool after detached except it hasother reference to the pool.

structworker*create_worker(structworker_pool*pool)¶: create a new workqueue worker

Parameters

structworker_pool*pool: pool the new worker will belong to

Description

Create and start a new worker which is attached topool.

Context

Might sleep. Does GFP_KERNEL allocations.

Return

Pointer to the newly created worker.

voidset_worker_dying(structworker*worker,structlist_head*list)¶: Tag a worker for destruction

Parameters

structworker*worker: worker to be destroyed
structlist_head*list: transfer worker away from its pool->idle_list and into list

Description

Tagworker for destruction and adjustpool stats accordingly. The workershould be idle.

Context

raw_spin_lock_irq(pool->lock).

voididle_worker_timeout(structtimer_list*t)¶: check if some idle workers can now be deleted.

Parameters

structtimer_list*t: The pool’s idle_timer that just expired

Description

The timer is armed inworker_enter_idle(). Note that it isn’t disarmed inworker_leave_idle(), as a worker flicking between idle and active while itspool is at thetoo_many_workers() tipping point would cause too much timerhousekeeping overhead. Since IDLE_WORKER_TIMEOUT is long enough, we just letit expire and re-evaluate things from there.

voididle_cull_fn(structwork_struct*work)¶: cull workers that have been idle for too long.

Parameters

structwork_struct*work: the pool’s work for handling these idle workers

Description

This goes through a pool’s idle workers and gets rid of those that have beenidle for at least IDLE_WORKER_TIMEOUT seconds.

We don’t want to disturb isolated CPUs because of a pcpu kworker beingculled, so this also resets worker affinity. This requires a sleepablecontext, hence the split between timer callback and work item.

voidmaybe_create_worker(structworker_pool*pool)¶: create a new worker if necessary

Parameters

structworker_pool*pool: pool to create a new worker for

Description

Create a new worker forpool if necessary.pool is guaranteed tohave at least one idle worker on return from this function. Ifcreating a new worker takes longer than MAYDAY_INTERVAL, mayday issent to all rescuers with works scheduled onpool to resolvepossible allocation deadlock.

On return,need_to_create_worker() is guaranteed to befalse andmay_start_working()true.

LOCKING:raw_spin_lock_irq(pool->lock) which may be released and regrabbedmultiple times. Does GFP_KERNEL allocations. Called only frommanager.

boolmanage_workers(structworker*worker)¶: manage worker pool

Parameters

structworker*worker: self

Description

Assume the manager role and manage the worker poolworker belongsto. At any given time, there can be only zero or one manager perpool. The exclusion is handled automatically by this function.

The caller can safely start processing works on false return. Ontrue return, it’s guaranteed thatneed_to_create_worker() is falseandmay_start_working() is true.

Context

raw_spin_lock_irq(pool->lock) which may be released and regrabbedmultiple times. Does GFP_KERNEL allocations.

Return

false if the pool doesn’t need management and the caller can safelystart processing works,true if management function was performed andthe conditions that the caller verified before calling the function mayno longer be true.

voidprocess_one_work(structworker*worker,structwork_struct*work)¶: process single work

Parameters

structworker*worker: self
structwork_struct*work: work to process

Description

Processwork. This function contains all the logics necessary toprocess a single work including synchronization against andinteraction with other workers on the same cpu, queueing andflushing. As long as context requirement is met, any worker cancall this function to process a work.

Context

raw_spin_lock_irq(pool->lock) which is released and regrabbed.

voidprocess_scheduled_works(structworker*worker)¶: process scheduled works

Parameters

structworker*worker: self

Description

Process all scheduled works. Please note that the scheduled listmay change while processing a work, so this function repeatedlyfetches a work from the top and executes it.

Context

raw_spin_lock_irq(pool->lock) which may be released and regrabbedmultiple times.

intworker_thread(void*__worker)¶: the worker thread function

Parameters

void*__worker: self

Description

The worker thread function. All workers belong to a worker_pool -either a per-cpu one or dynamic unbound one. These workers process allwork items regardless of their specific target workqueue. The onlyexception is work items which belong to workqueues with a rescuer whichwill be explained inrescuer_thread().

Return

intrescuer_thread(void*__rescuer)¶: the rescuer thread function

Parameters

void*__rescuer: self

Description

Workqueue rescuer thread function. There’s one rescuer for eachworkqueue which has WQ_MEM_RECLAIM set.

Regular work processing on a pool may block trying to create a newworker which uses GFP_KERNEL allocation which has slight chance ofdeveloping into deadlock if some works currently on the same queueneed to be processed to satisfy the GFP_KERNEL allocation. This isthe problem rescuer solves.

When such condition is possible, the pool summons rescuers of allworkqueues which have works queued on the pool and let them processthose works so that forward progress can be guaranteed.

This should happen rarely.

Return

voidcheck_flush_dependency(structworkqueue_struct*target_wq,structwork_struct*target_work,boolfrom_cancel)¶: check for flush dependency sanity

Parameters

structworkqueue_struct*target_wq: workqueue being flushed
structwork_struct*target_work: work item being flushed (NULL for workqueue flushes)
boolfrom_cancel: are we called from the work cancel path

Description

current is trying to flush the wholetarget_wq ortarget_work on it.If this is not the cancel path (which implies work being flushed is eitheralready running, or will not be at all), check iftarget_wq doesn’t haveWQ_MEM_RECLAIM and verify thatcurrent is not reclaiming memory or runningon a workqueue which doesn’t haveWQ_MEM_RECLAIM as that can break forward-progress guarantee leading to a deadlock.

voidinsert_wq_barrier(structpool_workqueue*pwq,structwq_barrier*barr,structwork_struct*target,structworker*worker)¶: insert a barrier work

Parameters

structpool_workqueue*pwq: pwq to insert barrier into
structwq_barrier*barr: wq_barrier to insert
structwork_struct*target: target work to attachbarr to
structworker*worker: worker currently executingtarget, NULL iftarget is not executing

Description

barr is linked totarget such thatbarr is completed only aftertarget finishes execution. Please note that the orderingguarantee is observed only with respect totarget and on the localcpu.

Currently, a queued barrier can’t be canceled. This is becausetry_to_grab_pending() can’t determine whether the work to begrabbed is at the head of the queue and thus can’t clear LINKEDflag of the previous work while there must be a valid next workafter a work with LINKED flag set.

Note that whenworker is non-NULL,target may be modifiedunderneath us, so we can’t reliably determine pwq fromtarget.

Context

raw_spin_lock_irq(pool->lock).

boolflush_workqueue_prep_pwqs(structworkqueue_struct*wq,intflush_color,intwork_color)¶: prepare pwqs for workqueue flushing

Parameters

structworkqueue_struct*wq: workqueue being flushed
intflush_color: new flush color, < 0 for no-op
intwork_color: new work color, < 0 for no-op

Description

Prepare pwqs for workqueue flushing.

Ifflush_color is non-negative, flush_color on all pwqs should be-1. If no pwq has in-flight commands at the specified color, allpwq->flush_color’s stay at -1 andfalse is returned. If any pwqhas in flight commands, its pwq->flush_color is set toflush_color,wq->nr_pwqs_to_flush is updated accordingly, pwqwakeup logic is armed andtrue is returned.

The caller should have initializedwq->first_flusher prior tocalling this function with non-negativeflush_color. Ifflush_color is negative, no flush color update is done andfalseis returned.

Ifwork_color is non-negative, all pwqs should have the samework_color which is previous towork_color and all will beadvanced towork_color.

Context

mutex_lock(wq->mutex).

Return

true ifflush_color >= 0 and there’s something to flush.falseotherwise.

void__flush_workqueue(structworkqueue_struct*wq)¶: ensure that any scheduled work has run to completion.

Parameters

structworkqueue_struct*wq: workqueue to flush

Description

This function sleeps until all work items which were queued on entryhave finished execution, but it is not livelocked by new incoming ones.

voiddrain_workqueue(structworkqueue_struct*wq)¶: drain a workqueue

Parameters

structworkqueue_struct*wq: workqueue to drain

Description

Wait until the workqueue becomes empty. While draining is in progress,only chain queueing is allowed. IOW, only currently pending or runningwork items onwq can queue further work items on it.wq is flushedrepeatedly until it becomes empty. The number of flushing is determinedby the depth of chaining and should be relatively short. Whine if ittakes too long.

boolflush_work(structwork_struct*work)¶: wait for a work to finish executing the last queueing instance

Parameters

structwork_struct*work: the work to flush

Description

Wait untilwork has finished execution.work is guaranteed to be idleon return if it hasn’t been requeued since flush started.

Return

true ifflush_work() waited for the work to finish execution,false if it was already idle.

boolflush_delayed_work(structdelayed_work*dwork)¶: wait for a dwork to finish executing the last queueing

Parameters

structdelayed_work*dwork: the delayed work to flush

Description

Delayed timer is cancelled and the pending work is queued forimmediate execution. Likeflush_work(), this function onlyconsiders the last queueing instance ofdwork.

Return

true ifflush_work() waited for the work to finish execution,false if it was already idle.

boolflush_rcu_work(structrcu_work*rwork)¶: wait for a rwork to finish executing the last queueing

Parameters

structrcu_work*rwork: the rcu work to flush

Return

true ifflush_rcu_work() waited for the work to finish execution,false if it was already idle.

boolcancel_work_sync(structwork_struct*work)¶: cancel a work and wait for it to finish

Parameters

structwork_struct*work: the work to cancel

Description

Cancelwork and wait for its execution to finish. This function can be usedeven if the work re-queues itself or migrates to another workqueue. On returnfrom this function,work is guaranteed to be not pending or executing on anyCPU as long as there aren’t racing enqueues.

cancel_work_sync(delayed_work->work) must not be used for delayed_work’s.Usecancel_delayed_work_sync() instead.

Must be called from a sleepable context ifwork was last queued on a non-BHworkqueue. Can also be called from non-hardirq atomic contexts including BHifwork was last queued on a BH workqueue.

Returnstrue ifwork was pending,false otherwise.

boolcancel_delayed_work(structdelayed_work*dwork)¶: cancel a delayed work

Parameters

structdelayed_work*dwork: delayed_work to cancel

Description

Kill off a pending delayed_work.

Return

true ifdwork was pending and canceled;false if it wasn’tpending.

Note

The work callback function may still be running on return, unlessit returnstrue and the work doesn’t re-arm itself. Explicitly flush orusecancel_delayed_work_sync() to wait on it.

This function is safe to call from any context including IRQ handler.

boolcancel_delayed_work_sync(structdelayed_work*dwork)¶: cancel a delayed work and wait for it to finish

Parameters

structdelayed_work*dwork: the delayed work cancel

Description

This iscancel_work_sync() for delayed works.

Return

true ifdwork was pending,false otherwise.

booldisable_work(structwork_struct*work)¶: Disable and cancel a work item

Parameters

structwork_struct*work: work item to disable

Description

Disablework by incrementing its disable count and cancel it if currentlypending. As long as the disable count is non-zero, any attempt to queueworkwill fail and returnfalse. The maximum supported disable depth is 2 to thepower ofWORK_OFFQ_DISABLE_BITS, currently 65536.

Can be called from any context. Returnstrue ifwork was pending,falseotherwise.

booldisable_work_sync(structwork_struct*work)¶: Disable, cancel and drain a work item

Parameters

structwork_struct*work: work item to disable

Description

Similar todisable_work() but also wait forwork to finish if currentlyexecuting.

Must be called from a sleepable context ifwork was last queued on a non-BHworkqueue. Can also be called from non-hardirq atomic contexts including BHifwork was last queued on a BH workqueue.

Returnstrue ifwork was pending,false otherwise.

boolenable_work(structwork_struct*work)¶: Enable a work item

Parameters

structwork_struct*work: work item to enable

Description

Undo disable_work[_sync]() by decrementingwork’s disable count.work canonly be queued if its disable count is 0.

Can be called from any context. Returnstrue if the disable count reached 0.Otherwise,false.

booldisable_delayed_work(structdelayed_work*dwork)¶: Disable and cancel a delayed work item

Parameters

structdelayed_work*dwork: delayed work item to disable

Description

disable_work() for delayed work items.

booldisable_delayed_work_sync(structdelayed_work*dwork)¶: Disable, cancel and drain a delayed work item

Parameters

structdelayed_work*dwork: delayed work item to disable

Description

disable_work_sync() for delayed work items.

boolenable_delayed_work(structdelayed_work*dwork)¶: Enable a delayed work item

Parameters

structdelayed_work*dwork: delayed work item to enable

Description

enable_work() for delayed work items.

intschedule_on_each_cpu(work_func_tfunc)¶: execute a function synchronously on each online CPU

Parameters

work_func_tfunc: the function to call

Description

schedule_on_each_cpu() executesfunc on each online CPU using thesystem workqueue and blocks until all CPUs have completed.schedule_on_each_cpu() is very slow.

Return

0 on success, -errno on failure.

intexecute_in_process_context(work_func_tfn,structexecute_work*ew)¶: reliably execute the routine with user context

Parameters

work_func_tfn: the function to execute
structexecute_work*ew: guaranteed storage for the execute work structure (mustbe available when the work executes)

Description

Executes the function immediately if process context is available,otherwise schedules the function for delayed execution.

Return

0 - function was executed1 - function was scheduled for execution

voidfree_workqueue_attrs(structworkqueue_attrs*attrs)¶: free a workqueue_attrs

Parameters

structworkqueue_attrs*attrs: workqueue_attrs to free

Description

Undoalloc_workqueue_attrs().

structworkqueue_attrs*alloc_workqueue_attrs(void)¶: allocate a workqueue_attrs

Parameters

void: no arguments

Description

Allocate a new workqueue_attrs, initialize with default settings andreturn it.

Return

The allocated new workqueue_attr on success.NULL on failure.

intinit_worker_pool(structworker_pool*pool)¶: initialize a newly zalloc’d worker_pool

Parameters

structworker_pool*pool: worker_pool to initialize

Description

Initialize a newly zalloc’dpool. It also allocatespool->attrs.

Return

0 on success, -errno on failure. Even on failure, all fieldsinsidepool proper are initialized andput_unbound_pool() can be calledonpool safely to release it.

voidput_unbound_pool(structworker_pool*pool)¶: put a worker_pool

Parameters

structworker_pool*pool: worker_pool to put

Description

Putpool. If its refcnt reaches zero, it gets destroyed in RCUsafe manner.get_unbound_pool() calls this function on its failure pathand this function should be able to release pools which went through,successfully or not,init_worker_pool().

Should be called with wq_pool_mutex held.

structworker_pool*get_unbound_pool(conststructworkqueue_attrs*attrs)¶: get a worker_pool with the specified attributes

Parameters

conststructworkqueue_attrs*attrs: the attributes of the worker_pool to get

Description

Obtain a worker_pool which has the same attributes asattrs, bump thereference count and return it. If there already is a matchingworker_pool, it will be used; otherwise, this function attempts tocreate a new one.

Should be called with wq_pool_mutex held.

Return

On success, a worker_pool with the same attributes asattrs.On failure,NULL.

voidwq_calc_pod_cpumask(structworkqueue_attrs*attrs,intcpu)¶: calculate a wq_attrs’ cpumask for a pod

Parameters

structworkqueue_attrs*attrs: the wq_attrs of the default pwq of the target workqueue
intcpu: the target CPU

Description

Calculate the cpumask a workqueue withattrs should use onpod.The result is stored inattrs->__pod_cpumask.

If pod affinity is not enabled,attrs->cpumask is always used. If enabledandpod has online CPUs requested byattrs, the returned cpumask is theintersection of the possible CPUs ofpod andattrs->cpumask.

The caller is responsible for ensuring that the cpumask ofpod stays stable.

intapply_workqueue_attrs(structworkqueue_struct*wq,conststructworkqueue_attrs*attrs)¶: apply new workqueue_attrs to an unbound workqueue

Parameters

structworkqueue_struct*wq: the target workqueue
conststructworkqueue_attrs*attrs: the workqueue_attrs to apply, allocated withalloc_workqueue_attrs()

Description

Applyattrs to an unbound workqueuewq. Unless disabled, this function mapsa separate pwq to each CPU pod with possibles CPUs inattrs->cpumask so thatwork items are affine to the pod it was issued on. Older pwqs are released asin-flight work items finish. Note that a work item which repeatedly requeuesitself back-to-back will stay on its current pwq.

Performs GFP_KERNEL allocations.

Return

0 on success and -errno on failure.

voidunbound_wq_update_pwq(structworkqueue_struct*wq,intcpu)¶: update a pwq slot for CPU hot[un]plug

Parameters

structworkqueue_struct*wq: the target workqueue
intcpu: the CPU to update the pwq slot for

Description

This function is to be called fromCPU_DOWN_PREPARE,CPU_ONLINE andCPU_DOWN_FAILED.cpu is in the same pod of the CPU being hot[un]plugged.

If pod affinity can’t be adjusted due to memory allocation failure, it fallsback towq->dfl_pwq which may not be optimal but is always correct.

Note that when the last allowed CPU of a pod goes offline for a workqueuewith a cpumask spanning multiple pods, the workers which were alreadyexecuting the work items for the workqueue will lose their CPU affinity andmay execute on any CPU. This is similar to how per-cpu workqueues behave onCPU_DOWN. If a workqueue user wants strict affinity, it’s the user’sresponsibility to flush the work item from CPU_DOWN_PREPARE.

voidwq_adjust_max_active(structworkqueue_struct*wq)¶: update a wq’s max_active to the current setting

Parameters

structworkqueue_struct*wq: target workqueue

Description

Ifwq isn’t freezing, setwq->max_active to the saved_max_active andactivate inactive work items accordingly. Ifwq is freezing, clearwq->max_active to zero.

voiddestroy_workqueue(structworkqueue_struct*wq)¶: safely terminate a workqueue

Parameters

structworkqueue_struct*wq: target workqueue

Description

Safely destroy a workqueue. All work currently pending will be done first.

This function does NOT guarantee that non-pending work that has beensubmitted withqueue_delayed_work() and similar functions will be donebefore destroying the workqueue. The fundamental problem is that, currently,the workqueue has no way of accessing non-pending delayed_work. delayed_workis only linked on the timer-side. All delayed_work must, therefore, becanceled before calling this function.

TODO: It would be better if the problem described above wouldn’t exist anddestroy_workqueue() would cleanly cancel all pending and non-pendingdelayed_work.

voidworkqueue_set_max_active(structworkqueue_struct*wq,intmax_active)¶: adjust max_active of a workqueue

Parameters

structworkqueue_struct*wq: target workqueue
intmax_active: new max_active value.

Description

Set max_active ofwq tomax_active. See thealloc_workqueue() functioncomment.

Context

Don’t call from IRQ context.

voidworkqueue_set_min_active(structworkqueue_struct*wq,intmin_active)¶: adjust min_active of an unbound workqueue

Parameters

structworkqueue_struct*wq: target unbound workqueue
intmin_active: new min_active value

Description

Set min_active of an unbound workqueue. Unlike other types of workqueues, anunbound workqueue is not guaranteed to be able to process max_activeinterdependent work items. Instead, an unbound workqueue is guaranteed to beable to process min_active number of interdependent work items which isWQ_DFL_MIN_ACTIVE by default.

Use this function to adjust the min_active value between 0 and the currentmax_active.

structwork_struct*current_work(void)¶: retrievecurrent task’s work struct

Parameters

void: no arguments

Description

Determine ifcurrent task is a workqueue worker and what it’s working on.Useful to find out the context that thecurrent task is running in.

Return

work struct ifcurrent task is a workqueue worker,NULL otherwise.

boolcurrent_is_workqueue_rescuer(void)¶: iscurrent workqueue rescuer?

Parameters

void: no arguments

Description

Determine whethercurrent is a workqueue rescuer. Can be used fromwork functions to determine whether it’s being run off the rescuer task.

Return

true ifcurrent is a workqueue rescuer.false otherwise.

boolworkqueue_congested(intcpu,structworkqueue_struct*wq)¶: test whether a workqueue is congested

Parameters

intcpu: CPU in question
structworkqueue_struct*wq: target workqueue

Description

Test whetherwq’s cpu workqueue forcpu is congested. There isno synchronization around this function and the test result isunreliable and only useful as advisory hints or for debugging.

Ifcpu is WORK_CPU_UNBOUND, the test is performed on the local CPU.

With the exception of ordered workqueues, all workqueues have per-cpupool_workqueues, each with its own congested state. A workqueue beingcongested on one CPU doesn’t mean that the workqueue is contested on anyother CPUs.

Return

true if congested,false otherwise.

unsignedintwork_busy(structwork_struct*work)¶: test whether a work is currently pending or running

Parameters

structwork_struct*work: the work to be tested

Description

Test whetherwork is currently pending or running. There is nosynchronization around this function and the test result isunreliable and only useful as advisory hints or for debugging.

Return

OR’d bitmask of WORK_BUSY_* bits.

voidset_worker_desc(constchar*fmt,...)¶: set description for the current work item

Parameters

constchar*fmt: printf-style format string
...: arguments for the format string

Description

This function can be called by a running work function to describe whatthe work item is about. If the worker task gets dumped, thisinformation will be printed out together to help debugging. Thedescription can be at most WORKER_DESC_LEN including the trailing ‘0’.

voidprint_worker_info(constchar*log_lvl,structtask_struct*task)¶: print out worker information and description

Parameters

constchar*log_lvl: the log level to use when printing
structtask_struct*task: target task

Description

Iftask is a worker and currently executing a work item, print out thename of the workqueue being serviced and worker description set withset_worker_desc() by the currently executing work item.

This function can be safely called on any task as long as thetask_struct itself is accessible. While safe, this function isn’tsynchronized and may print out mixups or garbages of limited length.

voidshow_one_workqueue(structworkqueue_struct*wq)¶: dump state of specified workqueue

Parameters

structworkqueue_struct*wq: workqueue whose state will be printed

voidshow_one_worker_pool(structworker_pool*pool)¶: dump state of specified worker pool

Parameters

structworker_pool*pool: worker pool whose state will be printed

voidshow_all_workqueues(void)¶: dump workqueue state

Parameters

void: no arguments

Description

Called from a sysrq handler and prints out all busy workqueues and pools.

voidshow_freezable_workqueues(void)¶: dump freezable workqueue state

Parameters

void: no arguments

Description

Called fromtry_to_freeze_tasks() and prints out all freezable workqueuesstill busy.

voidrebind_workers(structworker_pool*pool)¶: rebind all workers of a pool to the associated CPU

Parameters

structworker_pool*pool: pool of interest

Description

pool->cpu is coming online. Rebind all workers to the CPU.

voidrestore_unbound_workers_cpumask(structworker_pool*pool,intcpu)¶: restore cpumask of unbound workers

Parameters

structworker_pool*pool: unbound pool of interest
intcpu: the CPU which is coming up

Description

An unbound pool may end up with a cpumask which doesn’t have any onlineCPUs. When a worker of such pool get scheduled, the scheduler resetsits cpus_allowed. Ifcpu is inpool’s cpumask which didn’t have anyonline CPU before, cpus_allowed of all its workers should be restored.

longwork_on_cpu_key(intcpu,long(*fn)(void*),void*arg,structlock_class_key*key)¶: run a function in thread context on a particular cpu

Parameters

intcpu: the cpu to run on
long(*fn)(void*): the function to run
void*arg: the function arg
structlock_class_key*key: The lock class key for lock debugging purposes

Description

It is up to the caller to ensure that the cpu doesn’t go offline.The caller must not hold any locks which would preventfn from completing.

Return

The valuefn returns.

voidfreeze_workqueues_begin(void)¶: begin freezing workqueues

Parameters

void: no arguments

Description

Start freezing workqueues. After this function returns, all freezableworkqueues will queue new works to their inactive_works list instead ofpool->worklist.

Context

Grabs and releases wq_pool_mutex, wq->mutex and pool->lock’s.

boolfreeze_workqueues_busy(void)¶: are freezable workqueues still busy?

Parameters

void: no arguments

Description

Check whether freezing is complete. This function must be calledbetweenfreeze_workqueues_begin() andthaw_workqueues().

Context

Grabs and releases wq_pool_mutex.

Return

true if some freezable workqueues are still busy.false if freezingis complete.

voidthaw_workqueues(void)¶: thaw workqueues

Parameters

void: no arguments

Description

Thaw workqueues. Normal queueing is restored and all collectedfrozen works are transferred to their respective pool worklists.

Context

Grabs and releases wq_pool_mutex, wq->mutex and pool->lock’s.

intworkqueue_unbound_exclude_cpumask(cpumask_var_texclude_cpumask)¶: Exclude given CPUs from unbound cpumask

Parameters

cpumask_var_texclude_cpumask: the cpumask to be excluded from wq_unbound_cpumask

Description

This function can be called from cpuset code to provide a set of isolatedCPUs that should be excluded from wq_unbound_cpumask.

intworkqueue_set_unbound_cpumask(cpumask_var_tcpumask)¶: Set the low-level unbound cpumask

Parameters

cpumask_var_tcpumask: the cpumask to set

Description

The low-level workqueues cpumask is a global cpumask that limitsthe affinity of all unbound workqueues. This function check thecpumaskand apply it to all unbound workqueues and updates all pwqs of them.

Return

0 - Success-EINVAL - Invalidcpumask-ENOMEM - Failed to allocate memory for attrs or pwqs.

intworkqueue_sysfs_register(structworkqueue_struct*wq)¶: make a workqueue visible in sysfs

Parameters

structworkqueue_struct*wq: the workqueue to register

Description

Exposewq in sysfs under /sys/bus/workqueue/devices.alloc_workqueue*() automatically calls this function if WQ_SYSFS is setwhich is the preferred method.

Workqueue user should use this function directly iff it wants to applyworkqueue_attrs before making the workqueue visible in sysfs; otherwise,apply_workqueue_attrs() may race against userland updating theattributes.

Return

0 on success, -errno on failure.

voidworkqueue_sysfs_unregister(structworkqueue_struct*wq)¶: undoworkqueue_sysfs_register()

Parameters

structworkqueue_struct*wq: the workqueue to unregister

Description

Ifwq is registered to sysfs byworkqueue_sysfs_register(), unregister.

voidworkqueue_init_early(void)¶: early init for workqueue subsystem

Parameters

void: no arguments

Description

This is the first step of three-staged workqueue subsystem initialization andinvoked as soon as the bare basics - memory allocation, cpumasks and idr areup. It sets up all the data structures and system workqueues and allows earlyboot code to create workqueues and queue/cancel work items. Actual work itemexecution starts only after kthreads can be created and scheduled rightbefore early initcalls.

voidworkqueue_init(void)¶: bring workqueue subsystem fully online

Parameters

void: no arguments

Description

This is the second step of three-staged workqueue subsystem initializationand invoked as soon as kthreads can be created and scheduled. Workqueues havebeen created and work items queued on them, but there are no kworkersexecuting the work items yet. Populate the worker pools with the initialworkers and enable future kworker creations.

voidworkqueue_init_topology(void)¶: initialize CPU pods for unbound workqueues

Parameters

void: no arguments

Description

This is the third step of three-staged workqueue subsystem initialization andinvoked after SMP and topology information are fully initialized. Itinitializes the unbound CPU pods accordingly.

1	ifwork was pending and we successfully stole PENDING
0	ifwork was idle and we claimed PENDING
-EAGAIN	if PENDING couldn’t be grabbed at the moment, safe to busy-retry

Movatterモバイル変換

Workqueue¶

Introduction¶

Why Concurrency Managed Workqueue?¶

The Design¶

Application Programming Interface (API)¶

flags¶

max_active¶

Example Execution Scenarios¶

Guidelines¶

Affinity Scopes¶

Affinity Scopes and Performance¶

Scenario 1: Enough issuers and work spread across the machine¶

Scenario 2: Fewer issuers, enough work for saturation¶

Scenario 3: Even fewer issuers, not enough work to saturate¶

Conclusion and Recommendations¶

Examining Configuration¶

Monitoring¶

Debugging¶

Non-reentrance Conditions¶

Kernel Inline Documentations Reference¶

`flags`¶

`max_active`¶