The Linux Kernel API

List Management Functions

voidINIT_LIST_HEAD(struct list_head * list)

Initialize a list_head structure

Parameters

structlist_head*list
list_head structure to be initialized.

Description

Initializes the list_head to point to itself. If it is a list header,the result is an empty list.

voidlist_add(struct list_head * new, struct list_head * head)

add a new entry

Parameters

structlist_head*new
new entry to be added
structlist_head*head
list head to add it after

Description

Insert a new entry after the specified head.This is good for implementing stacks.

voidlist_add_tail(struct list_head * new, struct list_head * head)

add a new entry

Parameters

structlist_head*new
new entry to be added
structlist_head*head
list head to add it before

Description

Insert a new entry before the specified head.This is useful for implementing queues.

voidlist_del(struct list_head * entry)

deletes entry from list.

Parameters

structlist_head*entry
the element to delete from the list.

Note

list_empty() on entry does not return true after this, the entry isin an undefined state.

voidlist_replace(struct list_head * old, struct list_head * new)

replace old entry by new one

Parameters

structlist_head*old
the element to be replaced
structlist_head*new
the new element to insert

Description

Ifold was empty, it will be overwritten.

voidlist_replace_init(struct list_head * old, struct list_head * new)

replace old entry by new one and initialize the old one

Parameters

structlist_head*old
the element to be replaced
structlist_head*new
the new element to insert

Description

Ifold was empty, it will be overwritten.

voidlist_swap(struct list_head * entry1, struct list_head * entry2)

replace entry1 with entry2 and re-add entry1 at entry2’s position

Parameters

structlist_head*entry1
the location to place entry2
structlist_head*entry2
the location to place entry1
voidlist_del_init(struct list_head * entry)

deletes entry from list and reinitialize it.

Parameters

structlist_head*entry
the element to delete from the list.
voidlist_move(struct list_head * list, struct list_head * head)

delete from one list and add as another’s head

Parameters

structlist_head*list
the entry to move
structlist_head*head
the head that will precede our entry
voidlist_move_tail(struct list_head * list, struct list_head * head)

delete from one list and add as another’s tail

Parameters

structlist_head*list
the entry to move
structlist_head*head
the head that will follow our entry
voidlist_bulk_move_tail(struct list_head * head, struct list_head * first, struct list_head * last)

move a subsection of a list to its tail

Parameters

structlist_head*head
the head that will follow our entry
structlist_head*first
first entry to move
structlist_head*last
last entry to move, can be the same as first

Description

Move all entries betweenfirst and includinglast beforehead.All three entries must belong to the same linked list.

intlist_is_first(const struct list_head * list, const struct list_head * head)
  • tests whetherlist is the first entry in listhead

Parameters

conststructlist_head*list
the entry to test
conststructlist_head*head
the head of the list
intlist_is_last(const struct list_head * list, const struct list_head * head)

tests whetherlist is the last entry in listhead

Parameters

conststructlist_head*list
the entry to test
conststructlist_head*head
the head of the list
intlist_empty(const struct list_head * head)

tests whether a list is empty

Parameters

conststructlist_head*head
the list to test.
voidlist_del_init_careful(struct list_head * entry)

deletes entry from list and reinitialize it.

Parameters

structlist_head*entry
the element to delete from the list.

Description

This is the same aslist_del_init(), except designed to be usedtogether withlist_empty_careful() in a way to guarantee orderingof other memory operations.

Any memory operations done before alist_del_init_careful() areguaranteed to be visible after alist_empty_careful() test.

intlist_empty_careful(const struct list_head * head)

tests whether a list is empty and not being modified

Parameters

conststructlist_head*head
the list to test

Description

tests whether a list is empty _and_ checks that no other CPU might bein the process of modifying either member (next or prev)

NOTE

usinglist_empty_careful() without synchronizationcan only be safe if the only activity that can happento the list entry islist_del_init(). Eg. it cannot be usedif another CPU could re-list_add() it.

voidlist_rotate_left(struct list_head * head)

rotate the list to the left

Parameters

structlist_head*head
the head of the list
voidlist_rotate_to_front(struct list_head * list, struct list_head * head)

Rotate list to specific item.

Parameters

structlist_head*list
The desired new front of the list.
structlist_head*head
The head of the list.

Description

Rotates list so thatlist becomes the new front of the list.

intlist_is_singular(const struct list_head * head)

tests whether a list has just one entry.

Parameters

conststructlist_head*head
the list to test.
voidlist_cut_position(struct list_head * list, struct list_head * head, struct list_head * entry)

cut a list into two

Parameters

structlist_head*list
a new list to add all removed entries
structlist_head*head
a list with entries
structlist_head*entry
an entry within head, could be the head itselfand if so we won’t cut the list

Description

This helper moves the initial part ofhead, up to andincludingentry, fromhead tolist. You shouldpass onentry an element you know is onhead.listshould be an empty list or a list you do not care aboutlosing its data.

voidlist_cut_before(struct list_head * list, struct list_head * head, struct list_head * entry)

cut a list into two, before given entry

Parameters

structlist_head*list
a new list to add all removed entries
structlist_head*head
a list with entries
structlist_head*entry
an entry within head, could be the head itself

Description

This helper moves the initial part ofhead, up to butexcludingentry, fromhead tolist. You should passinentry an element you know is onhead.list shouldbe an empty list or a list you do not care about losingits data.Ifentry ==head, all entries onhead are moved tolist.

voidlist_splice(const struct list_head * list, struct list_head * head)

join two lists, this is designed for stacks

Parameters

conststructlist_head*list
the new list to add.
structlist_head*head
the place to add it in the first list.
voidlist_splice_tail(struct list_head * list, struct list_head * head)

join two lists, each list being a queue

Parameters

structlist_head*list
the new list to add.
structlist_head*head
the place to add it in the first list.
voidlist_splice_init(struct list_head * list, struct list_head * head)

join two lists and reinitialise the emptied list.

Parameters

structlist_head*list
the new list to add.
structlist_head*head
the place to add it in the first list.

Description

The list atlist is reinitialised

voidlist_splice_tail_init(struct list_head * list, struct list_head * head)

join two lists and reinitialise the emptied list

Parameters

structlist_head*list
the new list to add.
structlist_head*head
the place to add it in the first list.

Description

Each of the lists is a queue.The list atlist is reinitialised

list_entry(ptr,type,member)

get the struct for this entry

Parameters

ptr
thestructlist_head pointer.
type
the type of the struct this is embedded in.
member
the name of the list_head within the struct.
list_first_entry(ptr,type,member)

get the first element from a list

Parameters

ptr
the list head to take the element from.
type
the type of the struct this is embedded in.
member
the name of the list_head within the struct.

Description

Note, that list is expected to be not empty.

list_last_entry(ptr,type,member)

get the last element from a list

Parameters

ptr
the list head to take the element from.
type
the type of the struct this is embedded in.
member
the name of the list_head within the struct.

Description

Note, that list is expected to be not empty.

list_first_entry_or_null(ptr,type,member)

get the first element from a list

Parameters

ptr
the list head to take the element from.
type
the type of the struct this is embedded in.
member
the name of the list_head within the struct.

Description

Note that if the list is empty, it returns NULL.

list_next_entry(pos,member)

get the next element in list

Parameters

pos
the type * to cursor
member
the name of the list_head within the struct.
list_prev_entry(pos,member)

get the prev element in list

Parameters

pos
the type * to cursor
member
the name of the list_head within the struct.
list_for_each(pos,head)

iterate over a list

Parameters

pos
thestructlist_head to use as a loop cursor.
head
the head for your list.
list_for_each_continue(pos,head)

continue iteration over a list

Parameters

pos
thestructlist_head to use as a loop cursor.
head
the head for your list.

Description

Continue to iterate over a list, continuing after the current position.

list_for_each_prev(pos,head)

iterate over a list backwards

Parameters

pos
thestructlist_head to use as a loop cursor.
head
the head for your list.
list_for_each_safe(pos,n,head)

iterate over a list safe against removal of list entry

Parameters

pos
thestructlist_head to use as a loop cursor.
n
anotherstructlist_head to use as temporary storage
head
the head for your list.
list_for_each_prev_safe(pos,n,head)

iterate over a list backwards safe against removal of list entry

Parameters

pos
thestructlist_head to use as a loop cursor.
n
anotherstructlist_head to use as temporary storage
head
the head for your list.
list_for_each_entry(pos,head,member)

iterate over list of given type

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_head within the struct.
list_for_each_entry_reverse(pos,head,member)

iterate backwards over list of given type.

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_head within the struct.
list_prepare_entry(pos,head,member)

prepare a pos entry for use inlist_for_each_entry_continue()

Parameters

pos
the type * to use as a start point
head
the head of the list
member
the name of the list_head within the struct.

Description

Prepares a pos entry for use as a start point inlist_for_each_entry_continue().

list_for_each_entry_continue(pos,head,member)

continue iteration over list of given type

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_head within the struct.

Description

Continue to iterate over list of given type, continuing afterthe current position.

list_for_each_entry_continue_reverse(pos,head,member)

iterate backwards from the given point

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_head within the struct.

Description

Start to iterate over list of given type backwards, continuing afterthe current position.

list_for_each_entry_from(pos,head,member)

iterate over list of given type from the current point

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_head within the struct.

Description

Iterate over list of given type, continuing from current position.

list_for_each_entry_from_reverse(pos,head,member)

iterate backwards over list of given type from the current point

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_head within the struct.

Description

Iterate backwards over list of given type, continuing from current position.

list_for_each_entry_safe(pos,n,head,member)

iterate over list of given type safe against removal of list entry

Parameters

pos
the type * to use as a loop cursor.
n
another type * to use as temporary storage
head
the head for your list.
member
the name of the list_head within the struct.
list_for_each_entry_safe_continue(pos,n,head,member)

continue list iteration safe against removal

Parameters

pos
the type * to use as a loop cursor.
n
another type * to use as temporary storage
head
the head for your list.
member
the name of the list_head within the struct.

Description

Iterate over list of given type, continuing after current point,safe against removal of list entry.

list_for_each_entry_safe_from(pos,n,head,member)

iterate over list from current point safe against removal

Parameters

pos
the type * to use as a loop cursor.
n
another type * to use as temporary storage
head
the head for your list.
member
the name of the list_head within the struct.

Description

Iterate over list of given type from current point, safe againstremoval of list entry.

list_for_each_entry_safe_reverse(pos,n,head,member)

iterate backwards over list safe against removal

Parameters

pos
the type * to use as a loop cursor.
n
another type * to use as temporary storage
head
the head for your list.
member
the name of the list_head within the struct.

Description

Iterate backwards over list of given type, safe against removalof list entry.

list_safe_reset_next(pos,n,member)

reset a stale list_for_each_entry_safe loop

Parameters

pos
the loop cursor used in the list_for_each_entry_safe loop
n
temporary storage used in list_for_each_entry_safe
member
the name of the list_head within the struct.

Description

list_safe_reset_next is not safe to use in general if the list may bemodified concurrently (eg. the lock is dropped in the loop body). Anexception to this is if the cursor element (pos) is pinned in the list,and list_safe_reset_next is called after re-taking the lock and beforecompleting the current iteration of the loop body.

inthlist_unhashed(const struct hlist_node * h)

Has node been removed from list and reinitialized?

Parameters

conststructhlist_node*h
Node to be checked

Description

Not that not all removal functions will leave a node in unhashedstate. For example,hlist_nulls_del_init_rcu() does leave thenode in unhashed state, but hlist_nulls_del() does not.

inthlist_unhashed_lockless(const struct hlist_node * h)

Version of hlist_unhashed for lockless use

Parameters

conststructhlist_node*h
Node to be checked

Description

This variant ofhlist_unhashed() must be used in lockless contextsto avoid potential load-tearing. The READ_ONCE() is paired with thevarious WRITE_ONCE() in hlist helpers that are defined below.

inthlist_empty(const struct hlist_head * h)

Is the specified hlist_head structure an empty hlist?

Parameters

conststructhlist_head*h
Structure to check.
voidhlist_del(struct hlist_node * n)

Delete the specified hlist_node from its list

Parameters

structhlist_node*n
Node to delete.

Description

Note that this function leaves the node in hashed state. Usehlist_del_init() or similar instead to unhashn.

voidhlist_del_init(struct hlist_node * n)

Delete the specified hlist_node from its list and initialize

Parameters

structhlist_node*n
Node to delete.

Description

Note that this function leaves the node in unhashed state.

voidhlist_add_head(struct hlist_node * n, struct hlist_head * h)

add a new entry at the beginning of the hlist

Parameters

structhlist_node*n
new entry to be added
structhlist_head*h
hlist head to add it after

Description

Insert a new entry after the specified head.This is good for implementing stacks.

voidhlist_add_before(struct hlist_node * n, struct hlist_node * next)

add a new entry before the one specified

Parameters

structhlist_node*n
new entry to be added
structhlist_node*next
hlist node to add it before, which must be non-NULL
voidhlist_add_behind(struct hlist_node * n, struct hlist_node * prev)

add a new entry after the one specified

Parameters

structhlist_node*n
new entry to be added
structhlist_node*prev
hlist node to add it after, which must be non-NULL
voidhlist_add_fake(struct hlist_node * n)

create a fake hlist consisting of a single headless node

Parameters

structhlist_node*n
Node to make a fake list out of

Description

This makesn appear to be its own predecessor on a headless hlist.The point of this is to allow things likehlist_del() to work correctlyin cases where there is no list.

boolhlist_fake(struct hlist_node * h)

Parameters

structhlist_node*h
Node to check for being a self-referential fake hlist.
boolhlist_is_singular_node(struct hlist_node * n, struct hlist_head * h)

is node the only element of the specified hlist?

Parameters

structhlist_node*n
Node to check for singularity.
structhlist_head*h
Header for potentially singular list.

Description

Check whether the node is the only node of the head withoutaccessing head, thus avoiding unnecessary cache misses.

voidhlist_move_list(struct hlist_head * old, struct hlist_head * new)

Move an hlist

Parameters

structhlist_head*old
hlist_head for old list.
structhlist_head*new
hlist_head for new list.

Description

Move a list from one list head to another. Fixup the pprevreference of the first entry if it exists.

hlist_for_each_entry(pos,head,member)

iterate over list of given type

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the hlist_node within the struct.
hlist_for_each_entry_continue(pos,member)

iterate over a hlist continuing after current point

Parameters

pos
the type * to use as a loop cursor.
member
the name of the hlist_node within the struct.
hlist_for_each_entry_from(pos,member)

iterate over a hlist continuing from current point

Parameters

pos
the type * to use as a loop cursor.
member
the name of the hlist_node within the struct.
hlist_for_each_entry_safe(pos,n,head,member)

iterate over list of given type safe against removal of list entry

Parameters

pos
the type * to use as a loop cursor.
n
astructhlist_node to use as temporary storage
head
the head for your list.
member
the name of the hlist_node within the struct.

Basic C Library Functions

When writing drivers, you cannot in general use routines which are fromthe C Library. Some of the functions have been found generally usefuland they are listed below. The behaviour of these functions may varyslightly from those defined by ANSI, and these deviations are noted inthe text.

String Conversions

unsigned long longsimple_strtoull(const char * cp, char ** endp, unsigned int base)

convert a string to an unsigned long long

Parameters

constchar*cp
The start of the string
char**endp
A pointer to the end of the parsed string will be placed here
unsignedintbase
The number base to use

Description

This function has caveats. Please use kstrtoull instead.

unsigned longsimple_strtoul(const char * cp, char ** endp, unsigned int base)

convert a string to an unsigned long

Parameters

constchar*cp
The start of the string
char**endp
A pointer to the end of the parsed string will be placed here
unsignedintbase
The number base to use

Description

This function has caveats. Please use kstrtoul instead.

longsimple_strtol(const char * cp, char ** endp, unsigned int base)

convert a string to a signed long

Parameters

constchar*cp
The start of the string
char**endp
A pointer to the end of the parsed string will be placed here
unsignedintbase
The number base to use

Description

This function has caveats. Please use kstrtol instead.

long longsimple_strtoll(const char * cp, char ** endp, unsigned int base)

convert a string to a signed long long

Parameters

constchar*cp
The start of the string
char**endp
A pointer to the end of the parsed string will be placed here
unsignedintbase
The number base to use

Description

This function has caveats. Please use kstrtoll instead.

intvsnprintf(char * buf, size_t size, const char * fmt, va_list args)

Format a string and place it in a buffer

Parameters

char*buf
The buffer to place the result into
size_tsize
The size of the buffer, including the trailing null space
constchar*fmt
The format string to use
va_listargs
Arguments for the format string

Description

This function generally follows C99 vsnprintf, but has someextensions and a few limitations:

  • ``n`` is unsupported
  • ``p``* is handled by pointer()

See pointer() or Documentation/core-api/printk-formats.rst for moreextensive description.

Please update the documentation in both places when making changes

The return value is the number of characters which wouldbe generated for the given input, excluding the trailing‘0’, as per ISO C99. If you want to have the exactnumber of characters written intobuf as return value(not including the trailing ‘0’), usevscnprintf(). If thereturn is greater than or equal tosize, the resultingstring is truncated.

If you’re not already dealing with a va_list consider usingsnprintf().

intvscnprintf(char * buf, size_t size, const char * fmt, va_list args)

Format a string and place it in a buffer

Parameters

char*buf
The buffer to place the result into
size_tsize
The size of the buffer, including the trailing null space
constchar*fmt
The format string to use
va_listargs
Arguments for the format string

Description

The return value is the number of characters which have been written intothebuf not including the trailing ‘0’. Ifsize is == 0 the functionreturns 0.

If you’re not already dealing with a va_list consider usingscnprintf().

See thevsnprintf() documentation for format string extensions over C99.

intsnprintf(char * buf, size_t size, const char * fmt, ...)

Format a string and place it in a buffer

Parameters

char*buf
The buffer to place the result into
size_tsize
The size of the buffer, including the trailing null space
constchar*fmt
The format string to use
...
Arguments for the format string

Description

The return value is the number of characters which would begenerated for the given input, excluding the trailing null,as per ISO C99. If the return is greater than or equal tosize, the resulting string is truncated.

See thevsnprintf() documentation for format string extensions over C99.

intscnprintf(char * buf, size_t size, const char * fmt, ...)

Format a string and place it in a buffer

Parameters

char*buf
The buffer to place the result into
size_tsize
The size of the buffer, including the trailing null space
constchar*fmt
The format string to use
...
Arguments for the format string

Description

The return value is the number of characters written intobuf not includingthe trailing ‘0’. Ifsize is == 0 the function returns 0.

intvsprintf(char * buf, const char * fmt, va_list args)

Format a string and place it in a buffer

Parameters

char*buf
The buffer to place the result into
constchar*fmt
The format string to use
va_listargs
Arguments for the format string

Description

The function returns the number of characters writtenintobuf. Usevsnprintf() orvscnprintf() in order to avoidbuffer overflows.

If you’re not already dealing with a va_list consider usingsprintf().

See thevsnprintf() documentation for format string extensions over C99.

intsprintf(char * buf, const char * fmt, ...)

Format a string and place it in a buffer

Parameters

char*buf
The buffer to place the result into
constchar*fmt
The format string to use
...
Arguments for the format string

Description

The function returns the number of characters writtenintobuf. Usesnprintf() orscnprintf() in order to avoidbuffer overflows.

See thevsnprintf() documentation for format string extensions over C99.

intvbin_printf(u32 * bin_buf, size_t size, const char * fmt, va_list args)

Parse a format string and place args’ binary value in a buffer

Parameters

u32*bin_buf
The buffer to place args’ binary value
size_tsize
The size of the buffer(by words(32bits), not characters)
constchar*fmt
The format string to use
va_listargs
Arguments for the format string

Description

The format follows C99 vsnprintf, exceptn is ignored, and its argumentis skipped.

The return value is the number of words(32bits) which would be generated forthe given input.

NOTE

If the return value is greater thansize, the resulting bin_buf is NOTvalid forbstr_printf().

intbstr_printf(char * buf, size_t size, const char * fmt, const u32 * bin_buf)

Format a string from binary arguments and place it in a buffer

Parameters

char*buf
The buffer to place the result into
size_tsize
The size of the buffer, including the trailing null space
constchar*fmt
The format string to use
constu32*bin_buf
Binary arguments for the format string

Description

This function like C99 vsnprintf, but the difference is that vsnprintf getsarguments from stack, and bstr_printf gets arguments frombin_buf which isa binary buffer that generated by vbin_printf.

The format follows C99 vsnprintf, but has some extensions:
see vsnprintf comment for details.

The return value is the number of characters which wouldbe generated for the given input, excluding the trailing‘0’, as per ISO C99. If you want to have the exactnumber of characters written intobuf as return value(not including the trailing ‘0’), usevscnprintf(). If thereturn is greater than or equal tosize, the resultingstring is truncated.

intbprintf(u32 * bin_buf, size_t size, const char * fmt, ...)

Parse a format string and place args’ binary value in a buffer

Parameters

u32*bin_buf
The buffer to place args’ binary value
size_tsize
The size of the buffer(by words(32bits), not characters)
constchar*fmt
The format string to use
...
Arguments for the format string

Description

The function returns the number of words(u32) writtenintobin_buf.

intvsscanf(const char * buf, const char * fmt, va_list args)

Unformat a buffer into a list of arguments

Parameters

constchar*buf
input buffer
constchar*fmt
format of buffer
va_listargs
arguments
intsscanf(const char * buf, const char * fmt, ...)

Unformat a buffer into a list of arguments

Parameters

constchar*buf
input buffer
constchar*fmt
formatting of buffer
...
resulting arguments
intkstrtol(const char * s, unsigned int base, long * res)

convert a string to a long

Parameters

constchar*s
The start of the string. The string must be null-terminated, and may alsoinclude a single newline before its terminating null. The first charactermay also be a plus sign or a minus sign.
unsignedintbase
The number base to use. The maximum supported base is 16. If base isgiven as 0, then the base of the string is automatically detected with theconventional semantics - If it begins with 0x the number will be parsed as ahexadecimal (case insensitive), if it otherwise begins with 0, it will beparsed as an octal number. Otherwise it will be parsed as a decimal.
long*res
Where to write the result of the conversion on success.

Description

Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.Preferred oversimple_strtol(). Return code must be checked.

intkstrtoul(const char * s, unsigned int base, unsigned long * res)

convert a string to an unsigned long

Parameters

constchar*s
The start of the string. The string must be null-terminated, and may alsoinclude a single newline before its terminating null. The first charactermay also be a plus sign, but not a minus sign.
unsignedintbase
The number base to use. The maximum supported base is 16. If base isgiven as 0, then the base of the string is automatically detected with theconventional semantics - If it begins with 0x the number will be parsed as ahexadecimal (case insensitive), if it otherwise begins with 0, it will beparsed as an octal number. Otherwise it will be parsed as a decimal.
unsignedlong*res
Where to write the result of the conversion on success.

Description

Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.Preferred oversimple_strtoul(). Return code must be checked.

intkstrtoull(const char * s, unsigned int base, unsigned long long * res)

convert a string to an unsigned long long

Parameters

constchar*s
The start of the string. The string must be null-terminated, and may alsoinclude a single newline before its terminating null. The first charactermay also be a plus sign, but not a minus sign.
unsignedintbase
The number base to use. The maximum supported base is 16. If base isgiven as 0, then the base of the string is automatically detected with theconventional semantics - If it begins with 0x the number will be parsed as ahexadecimal (case insensitive), if it otherwise begins with 0, it will beparsed as an octal number. Otherwise it will be parsed as a decimal.
unsignedlonglong*res
Where to write the result of the conversion on success.

Description

Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.Preferred oversimple_strtoull(). Return code must be checked.

intkstrtoll(const char * s, unsigned int base, long long * res)

convert a string to a long long

Parameters

constchar*s
The start of the string. The string must be null-terminated, and may alsoinclude a single newline before its terminating null. The first charactermay also be a plus sign or a minus sign.
unsignedintbase
The number base to use. The maximum supported base is 16. If base isgiven as 0, then the base of the string is automatically detected with theconventional semantics - If it begins with 0x the number will be parsed as ahexadecimal (case insensitive), if it otherwise begins with 0, it will beparsed as an octal number. Otherwise it will be parsed as a decimal.
longlong*res
Where to write the result of the conversion on success.

Description

Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.Preferred oversimple_strtoll(). Return code must be checked.

intkstrtouint(const char * s, unsigned int base, unsigned int * res)

convert a string to an unsigned int

Parameters

constchar*s
The start of the string. The string must be null-terminated, and may alsoinclude a single newline before its terminating null. The first charactermay also be a plus sign, but not a minus sign.
unsignedintbase
The number base to use. The maximum supported base is 16. If base isgiven as 0, then the base of the string is automatically detected with theconventional semantics - If it begins with 0x the number will be parsed as ahexadecimal (case insensitive), if it otherwise begins with 0, it will beparsed as an octal number. Otherwise it will be parsed as a decimal.
unsignedint*res
Where to write the result of the conversion on success.

Description

Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.Preferred oversimple_strtoul(). Return code must be checked.

intkstrtoint(const char * s, unsigned int base, int * res)

convert a string to an int

Parameters

constchar*s
The start of the string. The string must be null-terminated, and may alsoinclude a single newline before its terminating null. The first charactermay also be a plus sign or a minus sign.
unsignedintbase
The number base to use. The maximum supported base is 16. If base isgiven as 0, then the base of the string is automatically detected with theconventional semantics - If it begins with 0x the number will be parsed as ahexadecimal (case insensitive), if it otherwise begins with 0, it will beparsed as an octal number. Otherwise it will be parsed as a decimal.
int*res
Where to write the result of the conversion on success.

Description

Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.Preferred oversimple_strtol(). Return code must be checked.

intkstrtobool(const char * s, bool * res)

convert common user inputs into boolean values

Parameters

constchar*s
input string
bool*res
result

Description

This routine returns 0 iff the first character is one of ‘Yy1Nn0’, or[oO][NnFf] for “on” and “off”. Otherwise it will return -EINVAL. Valuepointed to by res is updated upon finding a match.

voidstring_get_size(u64 size, u64 blk_size, const enum string_size_units units, char * buf, int len)

get the size in the specified units

Parameters

u64size
The size to be converted in blocks
u64blk_size
Size of the block (use 1 for size in bytes)
constenumstring_size_unitsunits
units to use (powers of 1000 or 1024)
char*buf
buffer to format to
intlen
length of buffer

Description

This function returns a string formatted to 3 significant figuresgiving the size in the required units.buf should have room forat least 9 bytes and will always be zero terminated.

intstring_unescape(char * src, char * dst, size_t size, unsigned int flags)

unquote characters in the given string

Parameters

char*src
source buffer (escaped)
char*dst
destination buffer (unescaped)
size_tsize
size of the destination buffer (0 to unlimit)
unsignedintflags
combination of the flags.

Description

The function unquotes characters in the given string.

Because the size of the output will be the same as or less than the size ofthe input, the transformation may be performed in place.

Caller must provide valid source and destination pointers. Be aware thatdestination buffer will always be NULL-terminated. Source string must beNULL-terminated as well. The supported flags are:

UNESCAPE_SPACE:        '\f' - form feed        '\n' - new line        '\r' - carriage return        '\t' - horizontal tab        '\v' - vertical tabUNESCAPE_OCTAL:        '\NNN' - byte with octal value NNN (1 to 3 digits)UNESCAPE_HEX:        '\xHH' - byte with hexadecimal value HH (1 to 2 digits)UNESCAPE_SPECIAL:        '\"' - double quote        '\\' - backslash        '\a' - alert (BEL)        '\e' - escapeUNESCAPE_ANY:        all previous together

Return

The amount of the characters processed to the destination buffer excludingtrailing ‘0’ is returned.

intstring_escape_mem(const char * src, size_t isz, char * dst, size_t osz, unsigned int flags, const char * only)

quote characters in the given memory buffer

Parameters

constchar*src
source buffer (unescaped)
size_tisz
source buffer size
char*dst
destination buffer (escaped)
size_tosz
destination buffer size
unsignedintflags
combination of the flags
constchar*only
NULL-terminated string containing characters used to limitthe selected escape class. If characters are included inonlythat would not normally be escaped by the classes selectedinflags, they will be copied todst unescaped.

Description

The process of escaping byte buffer includes several parts. They are appliedin the following sequence.

  1. The character is matched to the printable class, if asked, and incase of match it passes through to the output.
  2. The character is not matched to the one fromonly string and thusmust go as-is to the output.
  3. The character is checked if it falls into the class given byflags.ESCAPE_OCTAL andESCAPE_HEX are going last since they cover anycharacter. Note that they actually can’t go together, otherwiseESCAPE_HEX will be ignored.

Caller must provide valid source and destination pointers. Be aware thatdestination buffer will not be NULL-terminated, thus caller have to appendit if needs. The supported flags are:

%ESCAPE_SPACE: (special white space, not space itself)        '\f' - form feed        '\n' - new line        '\r' - carriage return        '\t' - horizontal tab        '\v' - vertical tab%ESCAPE_SPECIAL:        '\\' - backslash        '\a' - alert (BEL)        '\e' - escape%ESCAPE_NULL:        '\0' - null%ESCAPE_OCTAL:        '\NNN' - byte with octal value NNN (3 digits)%ESCAPE_ANY:        all previous together%ESCAPE_NP:        escape only non-printable characters (checked by isprint)%ESCAPE_ANY_NP:        all previous together%ESCAPE_HEX:        '\xHH' - byte with hexadecimal value HH (2 digits)

Return

The total size of the escaped output that would be generated forthe given input and flags. To check whether the output wastruncated, compare the return value to osz. There is room left indst for a ‘0’ terminator if and only if ret < osz.

String Manipulation

intstrncasecmp(const char * s1, const char * s2, size_t len)

Case insensitive, length-limited string comparison

Parameters

constchar*s1
One string
constchar*s2
The other string
size_tlen
the maximum number of characters to compare
char *strcpy(char * dest, const char * src)

Copy aNUL terminated string

Parameters

char*dest
Where to copy the string to
constchar*src
Where to copy the string from
char *strncpy(char * dest, const char * src, size_t count)

Copy a length-limited, C-string

Parameters

char*dest
Where to copy the string to
constchar*src
Where to copy the string from
size_tcount
The maximum number of bytes to copy

Description

The result is notNUL-terminated if the source exceedscount bytes.

In the case where the length ofsrc is less than that ofcount, the remainder ofdest will be padded withNUL.

size_tstrlcpy(char * dest, const char * src, size_t size)

Copy a C-string into a sized buffer

Parameters

char*dest
Where to copy the string to
constchar*src
Where to copy the string from
size_tsize
size of destination buffer

Description

Compatible with*BSD: the result is always a validNUL-terminated string that fits in the buffer (unless,of course, the buffer size is zero). It does not padout the result likestrncpy() does.

ssize_tstrscpy(char * dest, const char * src, size_t count)

Copy a C-string into a sized buffer

Parameters

char*dest
Where to copy the string to
constchar*src
Where to copy the string from
size_tcount
Size of destination buffer

Description

Copy the string, or as much of it as fits, into the dest buffer. Thebehavior is undefined if the string buffers overlap. The destinationbuffer is always NUL terminated, unless it’s zero-sized.

Preferred tostrlcpy() since the API doesn’t require reading memoryfrom the src string beyond the specified “count” bytes, and sincethe return value is easier to error-check thanstrlcpy()’s.In addition, the implementation is robust to the string changing outfrom underneath it, unlike the currentstrlcpy() implementation.

Preferred tostrncpy() since it always returns a valid string, anddoesn’t unnecessarily force the tail of the destination buffer to bezeroed. If zeroing is desired please usestrscpy_pad().

Return

  • The number of characters copied (not including the trailingNUL)
  • -E2BIG if count is 0 orsrc was truncated.
ssize_tstrscpy_pad(char * dest, const char * src, size_t count)

Copy a C-string into a sized buffer

Parameters

char*dest
Where to copy the string to
constchar*src
Where to copy the string from
size_tcount
Size of destination buffer

Description

Copy the string, or as much of it as fits, into the dest buffer. Thebehavior is undefined if the string buffers overlap. The destinationbuffer is alwaysNUL terminated, unless it’s zero-sized.

If the source string is shorter than the destination buffer, zerosthe tail of the destination buffer.

For full explanation of why you may want to consider using the‘strscpy’ functions please see the function docstring forstrscpy().

Return

  • The number of characters copied (not including the trailingNUL)
  • -E2BIG if count is 0 orsrc was truncated.
char *stpcpy(char *__restrict__ dest, const char *__restrict__ src)

copy a string from src to dest returning a pointer to the new end of dest, including src’sNUL-terminator. May overrun dest.

Parameters

char*__restrict__dest
pointer to end of string being copied into. Must be large enoughto receive copy.
constchar*__restrict__src
pointer to the beginning of string being copied from. Must not overlapdest.

Description

stpcpy differs from strcpy in a key way: the return value is a pointerto the newNUL-terminating character indest. (For strcpy, the returnvalue is a pointer to the start ofdest). This interface is consideredunsafe as it doesn’t perform bounds checking of the inputs. As such it’snot recommended for usage. Instead, its definition is provided in casethe compiler lowers other libcalls to stpcpy.

char *strcat(char * dest, const char * src)

Append oneNUL-terminated string to another

Parameters

char*dest
The string to be appended to
constchar*src
The string to append to it
char *strncat(char * dest, const char * src, size_t count)

Append a length-limited, C-string to another

Parameters

char*dest
The string to be appended to
constchar*src
The string to append to it
size_tcount
The maximum numbers of bytes to copy

Description

Note that in contrast tostrncpy(),strncat() ensures the result isterminated.

size_tstrlcat(char * dest, const char * src, size_t count)

Append a length-limited, C-string to another

Parameters

char*dest
The string to be appended to
constchar*src
The string to append to it
size_tcount
The size of the destination buffer.
intstrcmp(const char * cs, const char * ct)

Compare two strings

Parameters

constchar*cs
One string
constchar*ct
Another string
intstrncmp(const char * cs, const char * ct, size_t count)

Compare two length-limited strings

Parameters

constchar*cs
One string
constchar*ct
Another string
size_tcount
The maximum number of bytes to compare
char *strchr(const char * s, int c)

Find the first occurrence of a character in a string

Parameters

constchar*s
The string to be searched
intc
The character to search for

Description

Note that theNUL-terminator is considered part of the string, and canbe searched for.

char *strchrnul(const char * s, int c)

Find and return a character in a string, or end of string

Parameters

constchar*s
The string to be searched
intc
The character to search for

Description

Returns pointer to first occurrence of ‘c’ in s. If c is not found, thenreturn a pointer to the null byte at the end of s.

char *strrchr(const char * s, int c)

Find the last occurrence of a character in a string

Parameters

constchar*s
The string to be searched
intc
The character to search for
char *strnchr(const char * s, size_t count, int c)

Find a character in a length limited string

Parameters

constchar*s
The string to be searched
size_tcount
The number of characters to be searched
intc
The character to search for

Description

Note that theNUL-terminator is considered part of the string, and canbe searched for.

char *skip_spaces(const char * str)

Removes leading whitespace fromstr.

Parameters

constchar*str
The string to be stripped.

Description

Returns a pointer to the first non-whitespace character instr.

char *strim(char * s)

Removes leading and trailing whitespace froms.

Parameters

char*s
The string to be stripped.

Description

Note that the first trailing whitespace is replaced with aNUL-terminatorin the given strings. Returns a pointer to the first non-whitespacecharacter ins.

size_tstrlen(const char * s)

Find the length of a string

Parameters

constchar*s
The string to be sized
size_tstrnlen(const char * s, size_t count)

Find the length of a length-limited string

Parameters

constchar*s
The string to be sized
size_tcount
The maximum number of bytes to search
size_tstrspn(const char * s, const char * accept)

Calculate the length of the initial substring ofs which only contain letters inaccept

Parameters

constchar*s
The string to be searched
constchar*accept
The string to search for
size_tstrcspn(const char * s, const char * reject)

Calculate the length of the initial substring ofs which does not contain letters inreject

Parameters

constchar*s
The string to be searched
constchar*reject
The string to avoid
char *strpbrk(const char * cs, const char * ct)

Find the first occurrence of a set of characters

Parameters

constchar*cs
The string to be searched
constchar*ct
The characters to search for
char *strsep(char ** s, const char * ct)

Split a string into tokens

Parameters

char**s
The string to be searched
constchar*ct
The characters to search for

Description

strsep() updatess to point after the token, ready for the next call.

It returns empty tokens, too, behaving exactly like the libc functionof that name. In fact, it was stolen from glibc2 and de-fancy-fied.Same semantics, slimmer shape. ;)

boolsysfs_streq(const char * s1, const char * s2)

return true if strings are equal, modulo trailing newline

Parameters

constchar*s1
one string
constchar*s2
another string

Description

This routine returns true iff two strings are equal, treating bothNUL and newline-then-NUL as equivalent string terminations. It’sgeared for use with sysfs input strings, which generally terminatewith newlines but are compared against values without newlines.

intmatch_string(const char *const * array, size_t n, const char * string)

matches given string in an array

Parameters

constchar*const*array
array of strings
size_tn
number of strings in the array or -1 for NULL terminated arrays
constchar*string
string to match with

Description

This routine will look for a string in an array of strings up to then-th element in the array or until the first NULL element.

Historically the value of -1 forn, was used to search in arrays thatare NULL terminated. However, the function does not make a distinctionwhen finishing the search: eithern elements have been compared ORthe first NULL element was found.

Return

index of astring in thearray if matches, or-EINVAL otherwise.

int__sysfs_match_string(const char *const * array, size_t n, const char * str)

matches given string in an array

Parameters

constchar*const*array
array of strings
size_tn
number of strings in the array or -1 for NULL terminated arrays
constchar*str
string to match with

Description

Returns index ofstr in thearray or -EINVAL, just likematch_string().Uses sysfs_streq instead of strcmp for matching.

This routine will look for a string in an array of strings up to then-th element in the array or until the first NULL element.

Historically the value of -1 forn, was used to search in arrays thatare NULL terminated. However, the function does not make a distinctionwhen finishing the search: eithern elements have been compared ORthe first NULL element was found.

void *memset(void * s, int c, size_t count)

Fill a region of memory with the given value

Parameters

void*s
Pointer to the start of the area.
intc
The byte to fill the area with
size_tcount
The size of the area.

Description

Do not usememset() to access IO space, use memset_io() instead.

void *memset16(uint16_t * s, uint16_t v, size_t count)

Fill a memory area with a uint16_t

Parameters

uint16_t*s
Pointer to the start of the area.
uint16_tv
The value to fill the area with
size_tcount
The number of values to store

Description

Differs frommemset() in that it fills with a uint16_t insteadof a byte. Remember thatcount is the number of uint16_ts tostore, not the number of bytes.

void *memset32(uint32_t * s, uint32_t v, size_t count)

Fill a memory area with a uint32_t

Parameters

uint32_t*s
Pointer to the start of the area.
uint32_tv
The value to fill the area with
size_tcount
The number of values to store

Description

Differs frommemset() in that it fills with a uint32_t insteadof a byte. Remember thatcount is the number of uint32_ts tostore, not the number of bytes.

void *memset64(uint64_t * s, uint64_t v, size_t count)

Fill a memory area with a uint64_t

Parameters

uint64_t*s
Pointer to the start of the area.
uint64_tv
The value to fill the area with
size_tcount
The number of values to store

Description

Differs frommemset() in that it fills with a uint64_t insteadof a byte. Remember thatcount is the number of uint64_ts tostore, not the number of bytes.

void *memcpy(void * dest, const void * src, size_t count)

Copy one area of memory to another

Parameters

void*dest
Where to copy to
constvoid*src
Where to copy from
size_tcount
The size of the area.

Description

You should not use this function to access IO space, use memcpy_toio()or memcpy_fromio() instead.

void *memmove(void * dest, const void * src, size_t count)

Copy one area of memory to another

Parameters

void*dest
Where to copy to
constvoid*src
Where to copy from
size_tcount
The size of the area.

Description

Unlikememcpy(),memmove() copes with overlapping areas.

__visible intmemcmp(const void * cs, const void * ct, size_t count)

Compare two areas of memory

Parameters

constvoid*cs
One area of memory
constvoid*ct
Another area of memory
size_tcount
The size of the area.
intbcmp(const void * a, const void * b, size_t len)

returns 0 if and only if the buffers have identical contents.

Parameters

constvoid*a
pointer to first buffer.
constvoid*b
pointer to second buffer.
size_tlen
size of buffers.

Description

The sign or magnitude of a non-zero return value has no particularmeaning, and architectures may implement their own more efficientbcmp(). Sowhile this particular implementation is a simple (tail) call to memcmp, donot rely on anything but whether the return value is zero or non-zero.

void *memscan(void * addr, int c, size_t size)

Find a character in an area of memory.

Parameters

void*addr
The memory area
intc
The byte to search for
size_tsize
The size of the area.

Description

returns the address of the first occurrence ofc, or 1 byte pastthe area ifc is not found

char *strstr(const char * s1, const char * s2)

Find the first substring in aNUL terminated string

Parameters

constchar*s1
The string to be searched
constchar*s2
The string to search for
char *strnstr(const char * s1, const char * s2, size_t len)

Find the first substring in a length-limited string

Parameters

constchar*s1
The string to be searched
constchar*s2
The string to search for
size_tlen
the maximum number of characters to search
void *memchr(const void * s, int c, size_t n)

Find a character in an area of memory.

Parameters

constvoid*s
The memory area
intc
The byte to search for
size_tn
The size of the area.

Description

returns the address of the first occurrence ofc, orNULLifc is not found

void *memchr_inv(const void * start, int c, size_t bytes)

Find an unmatching character in an area of memory.

Parameters

constvoid*start
The memory area
intc
Find a character other than c
size_tbytes
The size of the area.

Description

returns the address of the first character other thanc, orNULLif the whole buffer contains justc.

char *strreplace(char * s, char old, char new)

Replace all occurrences of character in string.

Parameters

char*s
The string to operate on.
charold
The character being replaced.
charnew
The characterold is replaced with.

Description

Returns pointer to the nul byte at the end ofs.

sysfs_match_string(_a,_s)

matches given string in an array

Parameters

_a
array of strings
_s
string to match with

Description

Helper for__sysfs_match_string(). Calculates the size ofa automatically.

boolstrstarts(const char * str, const char * prefix)

doesstr start withprefix?

Parameters

constchar*str
string to examine
constchar*prefix
prefix to look for.
voidmemzero_explicit(void * s, size_t count)

Fill a region of memory (e.g. sensitive keying data) with 0s.

Parameters

void*s
Pointer to the start of the area.
size_tcount
The size of the area.

Note

usually usingmemset() is just fine (!), but in caseswhere clearing out _local_ data at the end of a scope isnecessary,memzero_explicit() should be used instead inorder to prevent the compiler from optimising away zeroing.

Description

memzero_explicit() doesn’t need an arch-specific version asit just invokes the one ofmemset() implicitly.

const char *kbasename(const char * path)

return the last part of a pathname.

Parameters

constchar*path
path to extract the filename from.
voidmemcpy_and_pad(void * dest, size_t dest_len, const void * src, size_t count, int pad)

Copy one buffer to another with padding

Parameters

void*dest
Where to copy to
size_tdest_len
The destination buffer size
constvoid*src
Where to copy from
size_tcount
The number of bytes to copy
intpad
Character to use for padding if space is left in destination.
size_tstr_has_prefix(const char * str, const char * prefix)

Test if a string has a given prefix

Parameters

constchar*str
The string to test
constchar*prefix
The string to see ifstr starts with

Description

A common way to test a prefix of a string is to do:
strncmp(str, prefix, sizeof(prefix) - 1)

But this can lead to bugs due to typos, or if prefix is a pointerand not a constant. Instead usestr_has_prefix().

Return

  • strlen(prefix) ifstr starts withprefix
  • 0 ifstr does not start withprefix
char *kstrdup(const char * s, gfp_t gfp)

allocate space for and copy an existing string

Parameters

constchar*s
the string to duplicate
gfp_tgfp
the GFP mask used in thekmalloc() call when allocating memory

Return

newly allocated copy ofs orNULL in case of error

const char *kstrdup_const(const char * s, gfp_t gfp)

conditionally duplicate an existing const string

Parameters

constchar*s
the string to duplicate
gfp_tgfp
the GFP mask used in thekmalloc() call when allocating memory

Note

Strings allocated by kstrdup_const should be freed by kfree_const.

Return

source string if it is in .rodata section otherwisefallback to kstrdup.

char *kstrndup(const char * s, size_t max, gfp_t gfp)

allocate space for and copy an existing string

Parameters

constchar*s
the string to duplicate
size_tmax
read at mostmax chars froms
gfp_tgfp
the GFP mask used in thekmalloc() call when allocating memory

Note

Usekmemdup_nul() instead if the size is known exactly.

Return

newly allocated copy ofs orNULL in case of error

void *kmemdup(const void * src, size_t len, gfp_t gfp)

duplicate region of memory

Parameters

constvoid*src
memory region to duplicate
size_tlen
memory region length
gfp_tgfp
GFP mask to use

Return

newly allocated copy ofsrc orNULL in case of error

char *kmemdup_nul(const char * s, size_t len, gfp_t gfp)

Create a NUL-terminated string from unterminated data

Parameters

constchar*s
The data to stringify
size_tlen
The size of the data
gfp_tgfp
the GFP mask used in thekmalloc() call when allocating memory

Return

newly allocated copy ofs with NUL-termination orNULL incase of error

void *memdup_user(const void __user * src, size_t len)

duplicate memory region from user space

Parameters

constvoid__user*src
source address in user space
size_tlen
number of bytes to copy

Return

an ERR_PTR() on failure. Result is physicallycontiguous, to be freed bykfree().

void *vmemdup_user(const void __user * src, size_t len)

duplicate memory region from user space

Parameters

constvoid__user*src
source address in user space
size_tlen
number of bytes to copy

Return

an ERR_PTR() on failure. Result may be notphysically contiguous. Usekvfree() to free.

char *strndup_user(const char __user * s, long n)

duplicate an existing string from user space

Parameters

constchar__user*s
The string to duplicate
longn
Maximum number of bytes to copy, including the trailing NUL.

Return

newly allocated copy ofs or an ERR_PTR() in case of error

void *memdup_user_nul(const void __user * src, size_t len)

duplicate memory region from user space and NUL-terminate

Parameters

constvoid__user*src
source address in user space
size_tlen
number of bytes to copy

Return

an ERR_PTR() on failure.

Basic Kernel Library Functions

The Linux kernel provides more basic utility functions.

Bit Operations

voidset_bit(long nr, volatile unsigned long * addr)

Atomically set a bit in memory

Parameters

longnr
the bit to set
volatileunsignedlong*addr
the address to start counting from

Description

This is a relaxed atomic operation (no implied memory barriers).

Note thatnr may be almost arbitrarily large; this function is notrestricted to acting on a single-word quantity.

voidclear_bit(long nr, volatile unsigned long * addr)

Clears a bit in memory

Parameters

longnr
Bit to clear
volatileunsignedlong*addr
Address to start counting from

Description

This is a relaxed atomic operation (no implied memory barriers).

voidchange_bit(long nr, volatile unsigned long * addr)

Toggle a bit in memory

Parameters

longnr
Bit to change
volatileunsignedlong*addr
Address to start counting from

Description

This is a relaxed atomic operation (no implied memory barriers).

Note thatnr may be almost arbitrarily large; this function is notrestricted to acting on a single-word quantity.

booltest_and_set_bit(long nr, volatile unsigned long * addr)

Set a bit and return its old value

Parameters

longnr
Bit to set
volatileunsignedlong*addr
Address to count from

Description

This is an atomic fully-ordered operation (implied full memory barrier).

booltest_and_clear_bit(long nr, volatile unsigned long * addr)

Clear a bit and return its old value

Parameters

longnr
Bit to clear
volatileunsignedlong*addr
Address to count from

Description

This is an atomic fully-ordered operation (implied full memory barrier).

booltest_and_change_bit(long nr, volatile unsigned long * addr)

Change a bit and return its old value

Parameters

longnr
Bit to change
volatileunsignedlong*addr
Address to count from

Description

This is an atomic fully-ordered operation (implied full memory barrier).

void__set_bit(long nr, volatile unsigned long * addr)

Set a bit in memory

Parameters

longnr
the bit to set
volatileunsignedlong*addr
the address to start counting from

Description

Unlikeset_bit(), this function is non-atomic. If it is called on the sameregion of memory concurrently, the effect may be that only one operationsucceeds.

void__clear_bit(long nr, volatile unsigned long * addr)

Clears a bit in memory

Parameters

longnr
the bit to clear
volatileunsignedlong*addr
the address to start counting from

Description

Unlikeclear_bit(), this function is non-atomic. If it is called on the sameregion of memory concurrently, the effect may be that only one operationsucceeds.

void__change_bit(long nr, volatile unsigned long * addr)

Toggle a bit in memory

Parameters

longnr
the bit to change
volatileunsignedlong*addr
the address to start counting from

Description

Unlikechange_bit(), this function is non-atomic. If it is called on the sameregion of memory concurrently, the effect may be that only one operationsucceeds.

bool__test_and_set_bit(long nr, volatile unsigned long * addr)

Set a bit and return its old value

Parameters

longnr
Bit to set
volatileunsignedlong*addr
Address to count from

Description

This operation is non-atomic. If two instances of this operation race, onecan appear to succeed but actually fail.

bool__test_and_clear_bit(long nr, volatile unsigned long * addr)

Clear a bit and return its old value

Parameters

longnr
Bit to clear
volatileunsignedlong*addr
Address to count from

Description

This operation is non-atomic. If two instances of this operation race, onecan appear to succeed but actually fail.

bool__test_and_change_bit(long nr, volatile unsigned long * addr)

Change a bit and return its old value

Parameters

longnr
Bit to change
volatileunsignedlong*addr
Address to count from

Description

This operation is non-atomic. If two instances of this operation race, onecan appear to succeed but actually fail.

booltest_bit(long nr, const volatile unsigned long * addr)

Determine whether a bit is set

Parameters

longnr
bit number to test
constvolatileunsignedlong*addr
Address to start counting from
voidclear_bit_unlock(long nr, volatile unsigned long * addr)

Clear a bit in memory, for unlock

Parameters

longnr
the bit to set
volatileunsignedlong*addr
the address to start counting from

Description

This operation is atomic and provides release barrier semantics.

void__clear_bit_unlock(long nr, volatile unsigned long * addr)

Clears a bit in memory

Parameters

longnr
Bit to clear
volatileunsignedlong*addr
Address to start counting from

Description

This is a non-atomic operation but implies a release barrier before thememory operation. It can be used for an unlock if no other CPUs canconcurrently modify other bits in the word.

booltest_and_set_bit_lock(long nr, volatile unsigned long * addr)

Set a bit and return its old value, for lock

Parameters

longnr
Bit to set
volatileunsignedlong*addr
Address to count from

Description

This operation is atomic and provides acquire barrier semantics ifthe returned value is 0.It can be used to implement bit locks.

boolclear_bit_unlock_is_negative_byte(long nr, volatile unsigned long * addr)

Clear a bit in memory and test if bottom byte is negative, for unlock.

Parameters

longnr
the bit to clear
volatileunsignedlong*addr
the address to start counting from

Description

This operation is atomic and provides release barrier semantics.

This is a bit of a one-trick-pony for the filemap code, which clearsPG_locked and tests PG_waiters,

Bitmap Operations

bitmaps provide an array of bits, implemented using an anarray of unsigned longs. The number of valid bits in agiven bitmap does _not_ need to be an exact multiple ofBITS_PER_LONG.

The possible unused bits in the last, partially used wordof a bitmap are ‘don’t care’. The implementation makesno particular effort to keep them zero. It ensures thattheir value will not affect the results of any operation.The bitmap operations that return Boolean (bitmap_empty,for example) or scalar (bitmap_weight, for example) resultscarefully filter out these unused bits from impacting theirresults.

The byte ordering of bitmaps is more natural on littleendian architectures. See the big-endian headersinclude/asm-ppc64/bitops.h and include/asm-s390/bitops.hfor the best explanations of this ordering.

The DECLARE_BITMAP(name,bits) macro, in linux/types.h, can be usedto declare an array named ‘name’ of just enough unsigned longs tocontain all bit positions from 0 to ‘bits’ - 1.

The available bitmap operations and their rough meaning in thecase that the bitmap is a single unsigned long are thus:

The generated code is more efficient when nbits is known atcompile-time and at most BITS_PER_LONG.

bitmap_zero(dst, nbits)                     *dst = 0ULbitmap_fill(dst, nbits)                     *dst = ~0ULbitmap_copy(dst, src, nbits)                *dst = *srcbitmap_and(dst, src1, src2, nbits)          *dst = *src1 & *src2bitmap_or(dst, src1, src2, nbits)           *dst = *src1 | *src2bitmap_xor(dst, src1, src2, nbits)          *dst = *src1 ^ *src2bitmap_andnot(dst, src1, src2, nbits)       *dst = *src1 & ~(*src2)bitmap_complement(dst, src, nbits)          *dst = ~(*src)bitmap_equal(src1, src2, nbits)             Are *src1 and *src2 equal?bitmap_intersects(src1, src2, nbits)        Do *src1 and *src2 overlap?bitmap_subset(src1, src2, nbits)            Is *src1 a subset of *src2?bitmap_empty(src, nbits)                    Are all bits zero in *src?bitmap_full(src, nbits)                     Are all bits set in *src?bitmap_weight(src, nbits)                   Hamming Weight: number set bitsbitmap_set(dst, pos, nbits)                 Set specified bit areabitmap_clear(dst, pos, nbits)               Clear specified bit areabitmap_find_next_zero_area(buf, len, pos, n, mask)  Find bit free areabitmap_find_next_zero_area_off(buf, len, pos, n, mask, mask_off)  as abovebitmap_next_clear_region(map, :c:type:`start`, :c:type:`end`, nbits)  Find next clear regionbitmap_next_set_region(map, :c:type:`start`, :c:type:`end`, nbits)  Find next set regionbitmap_for_each_clear_region(map, rs, re, start, end)                                            Iterate over all clear regionsbitmap_for_each_set_region(map, rs, re, start, end)                                            Iterate over all set regionsbitmap_shift_right(dst, src, n, nbits)      *dst = *src >> nbitmap_shift_left(dst, src, n, nbits)       *dst = *src << nbitmap_cut(dst, src, first, n, nbits)       Cut n bits from first, copy restbitmap_replace(dst, old, new, mask, nbits)  *dst = (*old & ~(*mask)) | (*new & *mask)bitmap_remap(dst, src, old, new, nbits)     *dst = map(old, new)(src)bitmap_bitremap(oldbit, old, new, nbits)    newbit = map(old, new)(oldbit)bitmap_onto(dst, orig, relmap, nbits)       *dst = orig relative to relmapbitmap_fold(dst, orig, sz, nbits)           dst bits = orig bits mod szbitmap_parse(buf, buflen, dst, nbits)       Parse bitmap dst from kernel bufbitmap_parse_user(ubuf, ulen, dst, nbits)   Parse bitmap dst from user bufbitmap_parselist(buf, dst, nbits)           Parse bitmap dst from kernel bufbitmap_parselist_user(buf, dst, nbits)      Parse bitmap dst from user bufbitmap_find_free_region(bitmap, bits, order)  Find and allocate bit regionbitmap_release_region(bitmap, pos, order)   Free specified bit regionbitmap_allocate_region(bitmap, pos, order)  Allocate specified bit regionbitmap_from_arr32(dst, buf, nbits)          Copy nbits from u32[] buf to dstbitmap_to_arr32(buf, src, nbits)            Copy nbits from buf to u32[] dstbitmap_get_value8(map, start)               Get 8bit value from map at startbitmap_set_value8(map, value, start)        Set 8bit value to map at start

Note, bitmap_zero() and bitmap_fill() operate over the region ofunsigned longs, that is, bits behind bitmap till the unsigned longboundary will be zeroed or filled as well. Consider to usebitmap_clear() or bitmap_set() to make explicit zeroing or fillingrespectively.

Also the following operations in asm/bitops.h apply to bitmaps.:

set_bit(bit, addr)                  *addr |= bitclear_bit(bit, addr)                *addr &= ~bitchange_bit(bit, addr)               *addr ^= bittest_bit(bit, addr)                 Is bit set in *addr?test_and_set_bit(bit, addr)         Set bit and return old valuetest_and_clear_bit(bit, addr)       Clear bit and return old valuetest_and_change_bit(bit, addr)      Change bit and return old valuefind_first_zero_bit(addr, nbits)    Position first zero bit in *addrfind_first_bit(addr, nbits)         Position first set bit in *addrfind_next_zero_bit(addr, nbits, bit)                                    Position next zero bit in *addr >= bitfind_next_bit(addr, nbits, bit)     Position next set bit in *addr >= bitfind_next_and_bit(addr1, addr2, nbits, bit)                                    Same as find_next_bit, but in                                    (*addr1 & *addr2)
void__bitmap_shift_right(unsigned long * dst, const unsigned long * src, unsigned shift, unsigned nbits)

logical right shift of the bits in a bitmap

Parameters

unsignedlong*dst
destination bitmap
constunsignedlong*src
source bitmap
unsignedshift
shift by this many bits
unsignednbits
bitmap size, in bits

Description

Shifting right (dividing) means moving bits in the MS -> LS bitdirection. Zeros are fed into the vacated MS positions and theLS bits shifted off the bottom are lost.

void__bitmap_shift_left(unsigned long * dst, const unsigned long * src, unsigned int shift, unsigned int nbits)

logical left shift of the bits in a bitmap

Parameters

unsignedlong*dst
destination bitmap
constunsignedlong*src
source bitmap
unsignedintshift
shift by this many bits
unsignedintnbits
bitmap size, in bits

Description

Shifting left (multiplying) means moving bits in the LS -> MSdirection. Zeros are fed into the vacated LS bit positionsand those MS bits shifted off the top are lost.

voidbitmap_cut(unsigned long * dst, const unsigned long * src, unsigned int first, unsigned int cut, unsigned int nbits)

remove bit region from bitmap and right shift remaining bits

Parameters

unsignedlong*dst
destination bitmap, might overlap with src
constunsignedlong*src
source bitmap
unsignedintfirst
start bit of region to be removed
unsignedintcut
number of bits to remove
unsignedintnbits
bitmap size, in bits

Description

Set the n-th bit ofdst iff the n-th bit ofsrc is set andn is less thanfirst, or the m-th bit ofsrc is set for anym such thatfirst <= n < nbits, and m = n +cut.

In pictures, example for a big-endian 32-bit architecture:

Thesrc bitmap is:

31                                   63|                                    |10000000 11000001 11110010 00010101  10000000 11000001 01110010 00010101                |  |              |                                    |               16  14             0                                   32

ifcut is 3, andfirst is 14, bits 14-16 insrc are cut anddst is:

31                                   63|                                    |10110000 00011000 00110010 00010101  00010000 00011000 00101110 01000010                   |              |                                    |                   14 (bit 17     0                                   32                       from @src)

Note thatdst andsrc might overlap partially or entirely.

This is implemented in the obvious way, with a shift and carrystep for each moved bit. Optimisation is left as an exercisefor the compiler.

unsigned longbitmap_find_next_zero_area_off(unsigned long * map, unsigned long size, unsigned long start, unsigned int nr, unsigned long align_mask, unsigned long align_offset)

find a contiguous aligned zero area

Parameters

unsignedlong*map
The address to base the search on
unsignedlongsize
The bitmap size in bits
unsignedlongstart
The bitnumber to start searching at
unsignedintnr
The number of zeroed bits we’re looking for
unsignedlongalign_mask
Alignment mask for zero area
unsignedlongalign_offset
Alignment offset for zero area.

Description

Thealign_mask should be one less than a power of 2; the effect is thatthe bit offset of all zero areas this function finds plusalign_offsetis multiple of that power of 2.

intbitmap_parse_user(const char __user * ubuf, unsigned int ulen, unsigned long * maskp, int nmaskbits)

convert an ASCII hex string in a user buffer into a bitmap

Parameters

constchar__user*ubuf
pointer to user buffer containing string.
unsignedintulen
buffer size in bytes. If string is smaller than thisthen it must be terminated with a 0.
unsignedlong*maskp
pointer to bitmap array that will contain result.
intnmaskbits
size of bitmap, in bits.
intbitmap_print_to_pagebuf(bool list, char * buf, const unsigned long * maskp, int nmaskbits)

convert bitmap to list or hex format ASCII string

Parameters

boollist
indicates whether the bitmap must be list
char*buf
page aligned buffer into which string is placed
constunsignedlong*maskp
pointer to bitmap to convert
intnmaskbits
size of bitmap, in bits

Description

Output format is a comma-separated list of decimal numbers andranges if list is specified or hex digits grouped into comma-separatedsets of 8 digits/set. Returns the number of characters written to buf.

It is assumed thatbuf is a pointer into a PAGE_SIZE, page-alignedarea and that sufficient storage remains atbuf to accommodate thebitmap_print_to_pagebuf() output. Returns the number of charactersactually printed tobuf, excluding terminating ‘0’.

intbitmap_parselist(const char * buf, unsigned long * maskp, int nmaskbits)

convert list format ASCII string to bitmap

Parameters

constchar*buf
read user string from this buffer; must be terminatedwith a 0 or n.
unsignedlong*maskp
write resulting mask here
intnmaskbits
number of bits in mask to be written

Description

Input format is a comma-separated list of decimal numbers andranges. Consecutively set bits are shown as two hyphen-separateddecimal numbers, the smallest and largest bit numbers set inthe range.Optionally each range can be postfixed to denote that only parts of itshould be set. The range will divided to groups of specific size.From each group will be used only defined amount of bits.Syntax: range:used_size/group_size

Example

0-1023:2/256 ==> 0,1,256,257,512,513,768,769

Return

0 on success, -errno on invalid input strings. Error values:

  • -EINVAL: wrong region format
  • -EINVAL: invalid character in string
  • -ERANGE: bit number specified too large for mask
  • -EOVERFLOW: integer overflow in the input parameters
intbitmap_parselist_user(const char __user * ubuf, unsigned int ulen, unsigned long * maskp, int nmaskbits)

Parameters

constchar__user*ubuf
pointer to user buffer containing string.
unsignedintulen
buffer size in bytes. If string is smaller than thisthen it must be terminated with a 0.
unsignedlong*maskp
pointer to bitmap array that will contain result.
intnmaskbits
size of bitmap, in bits.

Description

Wrapper forbitmap_parselist(), providing it with user buffer.

intbitmap_parse(const char * start, unsigned int buflen, unsigned long * maskp, int nmaskbits)

convert an ASCII hex string into a bitmap.

Parameters

constchar*start
pointer to buffer containing string.
unsignedintbuflen
buffer size in bytes. If string is smaller than thisthen it must be terminated with a 0 or n. In that case,UINT_MAX may be provided instead of string length.
unsignedlong*maskp
pointer to bitmap array that will contain result.
intnmaskbits
size of bitmap, in bits.

Description

Commas group hex digits into chunks. Each chunk defines exactly 32bits of the resultant bitmask. No chunk may specify a value largerthan 32 bits (-EOVERFLOW), and if a chunk specifies a smaller valuethen leading 0-bits are prepended.-EINVAL is returned for illegalcharacters. Grouping such as “1,,5”, “,44”, “,” or “” is allowed.Leading, embedded and trailing whitespace accepted.

intbitmap_find_free_region(unsigned long * bitmap, unsigned int bits, int order)

find a contiguous aligned mem region

Parameters

unsignedlong*bitmap
array of unsigned longs corresponding to the bitmap
unsignedintbits
number of bits in the bitmap
intorder
region size (log base 2 of number of bits) to find

Description

Find a region of free (zero) bits in abitmap ofbits bits andallocate them (set them to one). Only consider regions of lengtha power (order) of two, aligned to that power of two, whichmakes the search algorithm much faster.

Return the bit offset in bitmap of the allocated region,or -errno on failure.

voidbitmap_release_region(unsigned long * bitmap, unsigned int pos, int order)

release allocated bitmap region

Parameters

unsignedlong*bitmap
array of unsigned longs corresponding to the bitmap
unsignedintpos
beginning of bit region to release
intorder
region size (log base 2 of number of bits) to release

Description

This is the complement to __bitmap_find_free_region() and releasesthe found region (by clearing it in the bitmap).

No return value.

intbitmap_allocate_region(unsigned long * bitmap, unsigned int pos, int order)

allocate bitmap region

Parameters

unsignedlong*bitmap
array of unsigned longs corresponding to the bitmap
unsignedintpos
beginning of bit region to allocate
intorder
region size (log base 2 of number of bits) to allocate

Description

Allocate (set bits in) a specified region of a bitmap.

Return 0 on success, or-EBUSY if specified region wasn’tfree (not all bits were zero).

voidbitmap_copy_le(unsigned long * dst, const unsigned long * src, unsigned int nbits)

copy a bitmap, putting the bits into little-endian order.

Parameters

unsignedlong*dst
destination buffer
constunsignedlong*src
bitmap to copy
unsignedintnbits
number of bits in the bitmap

Description

Require nbits % BITS_PER_LONG == 0.

voidbitmap_from_arr32(unsigned long * bitmap, const u32 * buf, unsigned int nbits)

copy the contents of u32 array of bits to bitmap

Parameters

unsignedlong*bitmap
array of unsigned longs, the destination bitmap
constu32*buf
array of u32 (in host byte order), the source bitmap
unsignedintnbits
number of bits inbitmap
voidbitmap_to_arr32(u32 * buf, const unsigned long * bitmap, unsigned int nbits)

copy the contents of bitmap to a u32 array of bits

Parameters

u32*buf
array of u32 (in host byte order), the dest bitmap
constunsignedlong*bitmap
array of unsigned longs, the source bitmap
unsignedintnbits
number of bits inbitmap
intbitmap_pos_to_ord(const unsigned long * buf, unsigned int pos, unsigned int nbits)

find ordinal of set bit at given position in bitmap

Parameters

constunsignedlong*buf
pointer to a bitmap
unsignedintpos
a bit position inbuf (0 <=pos <nbits)
unsignedintnbits
number of valid bit positions inbuf

Description

Map the bit at positionpos inbuf (of lengthnbits) to theordinal of which set bit it is. If it is not set or ifposis not a valid bit position, map to -1.

If for example, just bits 4 through 7 are set inbuf, thenposvalues 4 through 7 will get mapped to 0 through 3, respectively,and otherpos values will get mapped to -1. Whenpos value 7gets mapped to (returns)ord value 3 in this example, that meansthat bit 7 is the 3rd (starting with 0th) set bit inbuf.

The bit positions 0 throughbits are valid positions inbuf.

unsigned intbitmap_ord_to_pos(const unsigned long * buf, unsigned int ord, unsigned int nbits)

find position of n-th set bit in bitmap

Parameters

constunsignedlong*buf
pointer to bitmap
unsignedintord
ordinal bit position (n-th set bit, n >= 0)
unsignedintnbits
number of valid bit positions inbuf

Description

Map the ordinal offset of bitord inbuf to its position inbuf.Value oford should be in range 0 <=ord < weight(buf). Iford>= weight(buf), returnsnbits.

If for example, just bits 4 through 7 are set inbuf, thenordvalues 0 through 3 will get mapped to 4 through 7, respectively,and all otherord values returnsnbits. Whenord value 3gets mapped to (returns)pos value 7 in this example, that meansthat the 3rd set bit (starting with 0th) is at position 7 inbuf.

The bit positions 0 throughnbits-1 are valid positions inbuf.

voidbitmap_remap(unsigned long * dst, const unsigned long * src, const unsigned long * old, const unsigned long * new, unsigned int nbits)

Apply map defined by a pair of bitmaps to another bitmap

Parameters

unsignedlong*dst
remapped result
constunsignedlong*src
subset to be remapped
constunsignedlong*old
defines domain of map
constunsignedlong*new
defines range of map
unsignedintnbits
number of bits in each of these bitmaps

Description

Letold andnew define a mapping of bit positions, such thatwhatever position is held by the n-th set bit inold is mappedto the n-th set bit innew. In the more general case, allowingfor the possibility that the weight ‘w’ ofnew is less than theweight ofold, map the position of the n-th set bit inold tothe position of the m-th set bit innew, where m == n % w.

If either of theold andnew bitmaps are empty, or ifsrc anddst point to the same location, then this routine copiessrctodst.

The positions of unset bits inold are mapped to themselves(the identify map).

Apply the above specified mapping tosrc, placing the result indst, clearing any bits previously set indst.

For example, lets say thatold has bits 4 through 7 set, andnew has bits 12 through 15 set. This defines the mapping of bitposition 4 to 12, 5 to 13, 6 to 14 and 7 to 15, and of all otherbit positions unchanged. So if saysrc comes into this routinewith bits 1, 5 and 7 set, thendst should leave with bits 1,13 and 15 set.

intbitmap_bitremap(int oldbit, const unsigned long * old, const unsigned long * new, int bits)

Apply map defined by a pair of bitmaps to a single bit

Parameters

intoldbit
bit position to be mapped
constunsignedlong*old
defines domain of map
constunsignedlong*new
defines range of map
intbits
number of bits in each of these bitmaps

Description

Letold andnew define a mapping of bit positions, such thatwhatever position is held by the n-th set bit inold is mappedto the n-th set bit innew. In the more general case, allowingfor the possibility that the weight ‘w’ ofnew is less than theweight ofold, map the position of the n-th set bit inold tothe position of the m-th set bit innew, where m == n % w.

The positions of unset bits inold are mapped to themselves(the identify map).

Apply the above specified mapping to bit positionoldbit, returningthe new bit position.

For example, lets say thatold has bits 4 through 7 set, andnew has bits 12 through 15 set. This defines the mapping of bitposition 4 to 12, 5 to 13, 6 to 14 and 7 to 15, and of all otherbit positions unchanged. So if sayoldbit is 5, then this routinereturns 13.

voidbitmap_onto(unsigned long * dst, const unsigned long * orig, const unsigned long * relmap, unsigned int bits)

translate one bitmap relative to another

Parameters

unsignedlong*dst
resulting translated bitmap
constunsignedlong*orig
original untranslated bitmap
constunsignedlong*relmap
bitmap relative to which translated
unsignedintbits
number of bits in each of these bitmaps

Description

Set the n-th bit ofdst iff there exists some m such that then-th bit ofrelmap is set, the m-th bit oforig is set, andthe n-th bit ofrelmap is also the m-th _set_ bit ofrelmap.(If you understood the previous sentence the first time yourread it, you’re overqualified for your current job.)

In other words,orig is mapped onto (surjectively)dst,using the map { <n, m> | the n-th bit ofrelmap is them-th set bit ofrelmap }.

Any set bits inorig above bit number W, where W is theweight of (number of set bits in)relmap are mapped nowhere.In particular, if for all bits m set inorig, m >= W, thendst will end up empty. In situations where the possibilityof such an empty result is not desired, one way to avoid it isto use thebitmap_fold() operator, below, to first fold theorig bitmap over itself so that all its set bits x are in therange 0 <= x < W. Thebitmap_fold() operator does this bysetting the bit (m % W) indst, for each bit (m) set inorig.

Example [1] for bitmap_onto():

Let’s sayrelmap has bits 30-39 set, andorig has bits1, 3, 5, 7, 9 and 11 set. Then on return from this routine,dst will have bits 31, 33, 35, 37 and 39 set.

When bit 0 is set inorig, it means turn on the bit indst corresponding to whatever is the first bit (if any)that is turned on inrelmap. Since bit 0 was off in theabove example, we leave off that bit (bit 30) indst.

When bit 1 is set inorig (as in the above example), itmeans turn on the bit indst corresponding to whateveris the second bit that is turned on inrelmap. The secondbit inrelmap that was turned on in the above example wasbit 31, so we turned on bit 31 indst.

Similarly, we turned on bits 33, 35, 37 and 39 indst,because they were the 4th, 6th, 8th and 10th set bitsset inrelmap, and the 4th, 6th, 8th and 10th bits oforig (i.e. bits 3, 5, 7 and 9) were also set.

When bit 11 is set inorig, it means turn on the bit indst corresponding to whatever is the twelfth bit that isturned on inrelmap. In the above example, there wereonly ten bits turned on inrelmap (30..39), so that bit11 was set inorig had no affect ondst.

Example [2] for bitmap_fold() + bitmap_onto():

Let’s sayrelmap has these ten bits set:

40 41 42 43 45 48 53 61 74 95

(for the curious, that’s 40 plus the first ten terms of theFibonacci sequence.)

Further lets say we use the following code, invokingbitmap_fold() then bitmap_onto, as suggested above toavoid the possibility of an emptydst result:

unsigned long *tmp;     // a temporary bitmap's bitsbitmap_fold(tmp, orig, bitmap_weight(relmap, bits), bits);bitmap_onto(dst, tmp, relmap, bits);

Then this table shows what various values ofdst would be, forvariousorig’s. I list the zero-based positions of each set bit.The tmp column shows the intermediate result, as computed byusingbitmap_fold() to fold theorig bitmap modulo ten(the weight ofrelmap):

origtmpdst
0040
1141
9995
10040[1]
1 3 5 71 3 5 741 43 48 61
0 1 2 3 40 1 2 3 440 41 42 43 45
0 9 18 270 9 8 740 61 74 95
0 10 20 30040
0 11 22 330 1 2 340 41 42 43
0 12 24 360 2 4 640 42 45 53
78 102 2111 2 841 42 74[1]
[1](1,2) For these marked lines, if we hadn’t first donebitmap_fold()into tmp, then thedst result would have been empty.

If either oforig orrelmap is empty (no set bits), thendstwill be returned empty.

If (as explained above) the only set bits inorig are in positionsm where m >= W, (where W is the weight ofrelmap) thendst willonce again be returned empty.

All bits indst not set by the above rule are cleared.

voidbitmap_fold(unsigned long * dst, const unsigned long * orig, unsigned int sz, unsigned int nbits)

fold larger bitmap into smaller, modulo specified size

Parameters

unsignedlong*dst
resulting smaller bitmap
constunsignedlong*orig
original larger bitmap
unsignedintsz
specified size
unsignedintnbits
number of bits in each of these bitmaps

Description

For each bit oldbit inorig, set bit oldbit modsz indst.Clear all other bits indst. See further the comment andExample [2] forbitmap_onto() for why and how to use this.

unsigned longbitmap_find_next_zero_area(unsigned long * map, unsigned long size, unsigned long start, unsigned int nr, unsigned long align_mask)

find a contiguous aligned zero area

Parameters

unsignedlong*map
The address to base the search on
unsignedlongsize
The bitmap size in bits
unsignedlongstart
The bitnumber to start searching at
unsignedintnr
The number of zeroed bits we’re looking for
unsignedlongalign_mask
Alignment mask for zero area

Description

Thealign_mask should be one less than a power of 2; the effect is thatthe bit offset of all zero areas this function finds is multiples of thatpower of 2. Aalign_mask of 0 means no alignment is required.

boolbitmap_or_equal(const unsigned long * src1, const unsigned long * src2, const unsigned long * src3, unsigned int nbits)

Check whether the or of two bitmaps is equal to a third

Parameters

constunsignedlong*src1
Pointer to bitmap 1
constunsignedlong*src2
Pointer to bitmap 2 will be or’ed with bitmap 1
constunsignedlong*src3
Pointer to bitmap 3. Compare to the result of*src1 |*src2
unsignedintnbits
number of bits in each of these bitmaps

Return

True if (*src1 |*src2) ==*src3, false otherwise

BITMAP_FROM_U64(n)

Represent u64 value in the format suitable for bitmap.

Parameters

n
u64 value

Description

Linux bitmaps are internally arrays of unsigned longs, i.e. 32-bitintegers in 32-bit environment, and 64-bit integers in 64-bit one.

There are four combinations of endianness and length of the word in linuxABIs: LE64, BE64, LE32 and BE32.

On 64-bit kernels 64-bit LE and BE numbers are naturally ordered inbitmaps and therefore don’t require any special handling.

On 32-bit kernels 32-bit LE ABI orders lo word of 64-bit number in memoryprior to hi, and 32-bit BE orders hi word prior to lo. The bitmap on theother hand is represented as an array of 32-bit words and the position ofbit N may therefore be calculated as: word #(N/32) and bit #(N``32``) in thatword. For example, bit #42 is located at 10th position of 2nd word.It matches 32-bit LE ABI, and we can simply let the compiler store 64-bitvalues in memory as it usually does. But for BE we need to swap hi and lowords manually.

With all that, the macroBITMAP_FROM_U64() does explicit reordering of hi andlo parts of u64. For LE32 it does nothing, and for BE environment it swapshi and lo words, as is expected by bitmap.

voidbitmap_from_u64(unsigned long * dst, u64 mask)

Check and swap words within u64.

Parameters

unsignedlong*dst
destination bitmap
u64mask
source bitmap

Description

In 32-bit Big Endian kernel, when using(u32*)(:c:type:`val`)[*]to read u64 mask, we will get the wrong word.That is(u32*)(:c:type:`val`)[0] gets the upper 32 bits,but we expect the lower 32-bits of u64.

unsigned longbitmap_get_value8(const unsigned long * map, unsigned long start)

get an 8-bit value within a memory region

Parameters

constunsignedlong*map
address to the bitmap memory region
unsignedlongstart
bit offset of the 8-bit value; must be a multiple of 8

Description

Returns the 8-bit value located at thestart bit offset within thesrcmemory region.

voidbitmap_set_value8(unsigned long * map, unsigned long value, unsigned long start)

set an 8-bit value within a memory region

Parameters

unsignedlong*map
address to the bitmap memory region
unsignedlongvalue
the 8-bit value; values wider than 8 bits may clobber bitmap
unsignedlongstart
bit offset of the 8-bit value; must be a multiple of 8

Command-line Parsing

intget_option(char ** str, int * pint)

Parse integer from an option string

Parameters

char**str
option string
int*pint

(output) integer value parsed fromstr

Read an int from an option string; if available accept a subsequentcomma as well.

Return values:0 - no int in string1 - int found, no subsequent comma2 - int found including a subsequent comma3 - hyphen found to denote a range

char *get_options(const char * str, int nints, int * ints)

Parse a string into a list of integers

Parameters

constchar*str
String to be parsed
intnints
size of integer array
int*ints

integer array

This function parses a string containing a comma-separatedlist of integers, a hyphen-separated range of _positive_ integers,or a combination of both. The parse halts when the array isfull, or when no more numbers can be retrieved from thestring.

Return value is the character in the string which causedthe parse to end (typically a null terminator, ifstr iscompletely parseable).

unsigned long longmemparse(const char * ptr, char ** retptr)

parse a string with mem suffixes into a number

Parameters

constchar*ptr
Where parse begins
char**retptr

(output) Optional pointer to next char after parse completes

Parses a string into a number. The number stored atptr ispotentially suffixed with K, M, G, T, P, E.

Sorting

voidsort_r(void * base, size_t num, size_t size, cmp_r_func_t cmp_func, swap_func_t swap_func, const void * priv)

sort an array of elements

Parameters

void*base
pointer to data to sort
size_tnum
number of elements
size_tsize
size of each element
cmp_r_func_tcmp_func
pointer to comparison function
swap_func_tswap_func
pointer to swap function or NULL
constvoid*priv
third argument passed to comparison function

Description

This function does a heapsort on the given array. You may providea swap_func function if you need to do something more than a memorycopy (e.g. fix up pointers or auxiliary data), but the built-in swapavoids a slow retpoline and so is significantly faster.

Sorting time is O(n log n) both on average and worst-case. Whilequicksort is slightly faster on average, it suffers from exploitableO(n*n) worst-case behavior and extra memory requirements that makeit less suitable for kernel use.

voidlist_sort(void * priv, struct list_head * head, int (*cmp)(void *priv, struct list_head *a, struct list_head *b))

sort a list

Parameters

void*priv
private data, opaque tolist_sort(), passed tocmp
structlist_head*head
the list to sort
int(*)(void*priv,structlist_head*a,structlist_head*b)cmp
the elements comparison function

Description

The comparison funtioncmp must return > 0 ifa should sort afterb (“a >b” if you want an ascending sort), and <= 0 ifa shouldsort beforebor their original order should be preserved. It isalways called with the element that came first in the input ina,and list_sort is a stable sort, so it is not necessary to distinguishthea <b anda ==b cases.

This is compatible with two styles ofcmp function:- The traditional style which returns <0 / =0 / >0, or- Returning a boolean 0/1.The latter offers a chance to save a few cycles in the comparison(which is used by e.g. plug_ctx_cmp() in block/blk-mq.c).

A good way to write a multi-word comparison is:

if (a->high != b->high)        return a->high > b->high;if (a->middle != b->middle)        return a->middle > b->middle;return a->low > b->low;

This mergesort is as eager as possible while always performing at least2:1 balanced merges. Given two pending sublists of size 2^k, they aremerged to a size-2^(k+1) list as soon as we have 2^k following elements.

Thus, it will avoid cache thrashing as long as 3*2^k elements canfit into the cache. Not quite as good as a fully-eager bottom-upmergesort, but it does use 0.2*n fewer comparisons, so is faster inthe common case that everything fits into L1.

The merging is controlled by “count”, the number of elements in thepending lists. This is beautiully simple code, but rather subtle.

Each time we increment “count”, we set one bit (bit k) and clearbits k-1 .. 0. Each time this happens (except the very first timefor each bit, when count increments to 2^k), we merge two lists ofsize 2^k into one list of size 2^(k+1).

This merge happens exactly when the count reaches an odd multiple of2^k, which is when we have 2^k elements pending in smaller lists,so it’s safe to merge away two lists of size 2^k.

After this happens twice, we have created two lists of size 2^(k+1),which will be merged into a list of size 2^(k+2) before we createa third list of size 2^(k+1), so there are never more than two pending.

The number of pending lists of size 2^k is determined by thestate of bit k of “count” plus two extra pieces of information:

  • The state of bit k-1 (when k == 0, consider bit -1 always set), and
  • Whether the higher-order bits are zero or non-zero (i.e.is count >= 2^(k+1)).

There are six states we distinguish. “x” represents some arbitrarybits, and “y” represents some arbitrary non-zero bits:0: 00x: 0 pending of size 2^k; x pending of sizes < 2^k1: 01x: 0 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k2: x10x: 0 pending of size 2^k; 2^k + x pending of sizes < 2^k3: x11x: 1 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k4: y00x: 1 pending of size 2^k; 2^k + x pending of sizes < 2^k5: y01x: 2 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k(merge and loop back to state 2)

We gain lists of size 2^k in the 2->3 and 4->5 transitions (becausebit k-1 is set while the more significant bits are non-zero) andmerge them away in the 5->2 transition. Note in particular that justbefore the 5->2 transition, all lower-order bits are 11 (state 3),so there is one list of each smaller size.

When we reach the end of the input, we merge all the pendinglists, from smallest to largest. If you work through cases 2 to5 above, you can see that the number of elements we merge with a listof size 2^k varies from 2^(k-1) (cases 3 and 5 when x == 0) to2^(k+1) - 1 (second merge of case 5 when x == 2^(k-1) - 1).

Text Searching

INTRODUCTION

The textsearch infrastructure provides text searching facilities forboth linear and non-linear data. Individual search algorithms areimplemented in modules and chosen by the user.

ARCHITECTURE

  User  +----------------+  |        finish()|<--------------(6)-----------------+  |get_next_block()|<--------------(5)---------------+ |  |                |                     Algorithm   | |  |                |                    +------------------------------+  |                |                    |  init()   find()   destroy() |  |                |                    +------------------------------+  |                |       Core API           ^       ^          ^  |                |      +---------------+  (2)     (4)        (8)  |             (1)|----->| prepare()     |---+       |          |  |             (3)|----->| find()/next() |-----------+          |  |             (7)|----->| destroy()     |----------------------+  +----------------+      +---------------+(1) User configures a search by calling textsearch_prepare() specifying    the search parameters such as the pattern and algorithm name.(2) Core requests the algorithm to allocate and initialize a search    configuration according to the specified parameters.(3) User starts the search(es) by calling textsearch_find() or    textsearch_next() to fetch subsequent occurrences. A state variable    is provided to the algorithm to store persistent variables.(4) Core eventually resets the search offset and forwards the find()    request to the algorithm.(5) Algorithm calls get_next_block() provided by the user continuously    to fetch the data to be searched in block by block.(6) Algorithm invokes finish() after the last call to get_next_block    to clean up any leftovers from get_next_block. (Optional)(7) User destroys the configuration by calling textsearch_destroy().(8) Core notifies the algorithm to destroy algorithm specific    allocations. (Optional)

USAGE

Before a search can be performed, a configuration must be createdby callingtextsearch_prepare() specifying the searching algorithm,the pattern to look for and flags. As a flag, you can set TS_IGNORECASEto perform case insensitive matching. But it might slow downperformance of algorithm, so you should use it at own your risk.The returned configuration may then be used for an arbitraryamount of times and even in parallel as long as a separate structts_state variable is provided to every instance.

The actual search is performed by either callingtextsearch_find_continuous() for linear data or by providingan own get_next_block() implementation andcallingtextsearch_find(). Both functions returnthe position of the first occurrence of the pattern or UINT_MAX ifno match was found. Subsequent occurrences can be found by callingtextsearch_next() regardless of the linearity of the data.

Once you’re done using a configuration it must be given back viatextsearch_destroy.

EXAMPLE:

int pos;struct ts_config *conf;struct ts_state state;const char *pattern = "chicken";const char *example = "We dance the funky chicken";conf = textsearch_prepare("kmp", pattern, strlen(pattern),                          GFP_KERNEL, TS_AUTOLOAD);if (IS_ERR(conf)) {    err = PTR_ERR(conf);    goto errout;}pos = textsearch_find_continuous(conf, &state, example, strlen(example));if (pos != UINT_MAX)    panic("Oh my god, dancing chickens at %d\n", pos);textsearch_destroy(conf);
inttextsearch_register(struct ts_ops * ops)

register a textsearch module

Parameters

structts_ops*ops
operations lookup table

Description

This function must be called by textsearch modules to announcetheir presence. The specified &**ops** must havename set to aunique identifier and the callbacks find(), init(), get_pattern(),and get_pattern_len() must be implemented.

Returns 0 or -EEXISTS if another module has already registeredwith same name.

inttextsearch_unregister(struct ts_ops * ops)

unregister a textsearch module

Parameters

structts_ops*ops
operations lookup table

Description

This function must be called by textsearch modules to announcetheir disappearance for examples when the module gets unloaded.Theops parameter must be the same as the one during theregistration.

Returns 0 on success or -ENOENT if no matching textsearchregistration was found.

unsigned inttextsearch_find_continuous(struct ts_config * conf, struct ts_state * state, const void * data, unsigned int len)

search a pattern in continuous/linear data

Parameters

structts_config*conf
search configuration
structts_state*state
search state
constvoid*data
data to search in
unsignedintlen
length of data

Description

A simplified version oftextsearch_find() for continuous/linear data.Calltextsearch_next() to retrieve subsequent matches.

Returns the position of first occurrence of the pattern orUINT_MAX if no occurrence was found.

struct ts_config *textsearch_prepare(const char * algo, const void * pattern, unsigned int len, gfp_t gfp_mask, int flags)

Prepare a search

Parameters

constchar*algo
name of search algorithm
constvoid*pattern
pattern data
unsignedintlen
length of pattern
gfp_tgfp_mask
allocation mask
intflags
search flags

Description

Looks up the search algorithm module and creates a new textsearchconfiguration for the specified pattern.

Returns a new textsearch configuration according to the specifiedparameters or a ERR_PTR(). If a zero length pattern is passed, thisfunction returns EINVAL.

Note

The format of the pattern may not be compatible between
the various search algorithms.
voidtextsearch_destroy(struct ts_config * conf)

destroy a search configuration

Parameters

structts_config*conf
search configuration

Description

Releases all references of the configuration and freesup the memory.

unsigned inttextsearch_next(struct ts_config * conf, struct ts_state * state)

continue searching for a pattern

Parameters

structts_config*conf
search configuration
structts_state*state
search state

Description

Continues a search looking for more occurrences of the pattern.textsearch_find() must be called to find the first occurrencein order to reset the state.

Returns the position of the next occurrence of the pattern orUINT_MAX if not match was found.

unsigned inttextsearch_find(struct ts_config * conf, struct ts_state * state)

start searching for a pattern

Parameters

structts_config*conf
search configuration
structts_state*state
search state

Description

Returns the position of first occurrence of the pattern orUINT_MAX if no match was found.

void *textsearch_get_pattern(struct ts_config * conf)

return head of the pattern

Parameters

structts_config*conf
search configuration
unsigned inttextsearch_get_pattern_len(struct ts_config * conf)

return length of the pattern

Parameters

structts_config*conf
search configuration

CRC and Math Functions in Linux

CRC Functions

uint8_tcrc4(uint8_t c, uint64_t x, int bits)

calculate the 4-bit crc of a value.

Parameters

uint8_tc
starting crc4
uint64_tx
value to checksum
intbits
number of bits inx to checksum

Description

Returns the crc4 value ofx, using polynomial 0b10111.

Thex value is treated as left-aligned, and bits abovebits are ignoredin the crc calculations.

u8crc7_be(u8 crc, const u8 * buffer, size_t len)

update the CRC7 for the data buffer

Parameters

u8crc
previous CRC7 value
constu8*buffer
data pointer
size_tlen
number of bytes in the buffer

Context

any

Description

Returns the updated CRC7 value.The CRC7 is left-aligned in the byte (the lsbit is always 0), as thatmakes the computation easier, and all callers want it in that form.

voidcrc8_populate_msb(u8 table, u8 polynomial)

fill crc table for given polynomial in reverse bit order.

Parameters

u8table
table to be filled.
u8polynomial
polynomial for which table is to be filled.
voidcrc8_populate_lsb(u8 table, u8 polynomial)

fill crc table for given polynomial in regular bit order.

Parameters

u8table
table to be filled.
u8polynomial
polynomial for which table is to be filled.
u8crc8(const u8 table, u8 * pdata, size_t nbytes, u8 crc)

calculate a crc8 over the given input data.

Parameters

constu8table
crc table used for calculation.
u8*pdata
pointer to data buffer.
size_tnbytes
number of bytes in data buffer.
u8crc
previous returned crc8 value.
u16crc16(u16 crc, u8 const * buffer, size_t len)

compute the CRC-16 for the data buffer

Parameters

u16crc
previous CRC value
u8const*buffer
data pointer
size_tlen
number of bytes in the buffer

Description

Returns the updated CRC value.

u32 __purecrc32_le_generic(u32 crc, unsigned char const * p, size_t len, const u32 ( * tab, u32 polynomial)

Calculate bitwise little-endian Ethernet AUTODIN II CRC32/CRC32C

Parameters

u32crc
seed value for computation. ~0 for Ethernet, sometimes 0 for otheruses, or the previous crc32/crc32c value if computing incrementally.
unsignedcharconst*p
pointer to buffer over which CRC32/CRC32C is run
size_tlen
length of bufferp
constu32(*tab
little-endian Ethernet table
u32polynomial
CRC32/CRC32c LE polynomial
u32 __attribute_const__crc32_generic_shift(u32 crc, size_t len, u32 polynomial)

Appendlen 0 bytes to crc, in logarithmic time

Parameters

u32crc
The original little-endian CRC (i.e. lsbit is x^31 coefficient)
size_tlen
The number of bytes.crc is multiplied by x^(8***len**)
u32polynomial
The modulus used to reduce the result to 32 bits.

Description

It’s possible to parallelize CRC computations by computing a CRCover separate ranges of a buffer, then summing them.This shifts the given CRC by 8*len bits (i.e. produces the same effectas appending len bytes of zero to the data), in time proportionalto log(len).

u32 __purecrc32_be_generic(u32 crc, unsigned char const * p, size_t len, const u32 ( * tab, u32 polynomial)

Calculate bitwise big-endian Ethernet AUTODIN II CRC32

Parameters

u32crc
seed value for computation. ~0 for Ethernet, sometimes 0 forother uses, or the previous crc32 value if computing incrementally.
unsignedcharconst*p
pointer to buffer over which CRC32 is run
size_tlen
length of bufferp
constu32(*tab
big-endian Ethernet table
u32polynomial
CRC32 BE polynomial
u16crc_ccitt(u16 crc, u8 const * buffer, size_t len)

recompute the CRC (CRC-CCITT variant) for the data buffer

Parameters

u16crc
previous CRC value
u8const*buffer
data pointer
size_tlen
number of bytes in the buffer
u16crc_ccitt_false(u16 crc, u8 const * buffer, size_t len)

recompute the CRC (CRC-CCITT-FALSE variant) for the data buffer

Parameters

u16crc
previous CRC value
u8const*buffer
data pointer
size_tlen
number of bytes in the buffer
u16crc_itu_t(u16 crc, const u8 * buffer, size_t len)

Compute the CRC-ITU-T for the data buffer

Parameters

u16crc
previous CRC value
constu8*buffer
data pointer
size_tlen
number of bytes in the buffer

Description

Returns the updated CRC value

Base 2 log and power Functions

boolis_power_of_2(unsigned long n)

check if a value is a power of two

Parameters

unsignedlongn
the value to check

Description

Determine whether some value is a power of two, where zero isnot considered a power of two.

Return

true ifn is a power of 2, otherwise false.

unsigned long__roundup_pow_of_two(unsigned long n)

round up to nearest power of two

Parameters

unsignedlongn
value to round up
unsigned long__rounddown_pow_of_two(unsigned long n)

round down to nearest power of two

Parameters

unsignedlongn
value to round down
const_ilog2(n)

log base 2 of 32-bit or a 64-bit constant unsigned value

Parameters

n
parameter

Description

Use this where sparse expects a true constant expression, e.g. for arrayindices.

ilog2(n)

log base 2 of 32-bit or a 64-bit unsigned value

Parameters

n
parameter

Description

constant-capable log of base 2 calculation- this can be used to initialise global variables from constant data, hencethe massive ternary operator construction

selects the appropriately-sized optimised version depending on sizeof(n)

roundup_pow_of_two(n)

round the given value up to nearest power of two

Parameters

n
parameter

Description

round the given value up to the nearest power of two- the result is undefined when n == 0- this can be used to initialise global variables from constant data

rounddown_pow_of_two(n)

round the given value down to nearest power of two

Parameters

n
parameter

Description

round the given value down to the nearest power of two- the result is undefined when n == 0- this can be used to initialise global variables from constant data

order_base_2(n)

calculate the (rounded up) base 2 order of the argument

Parameters

n
parameter

Description

The first few values calculated by this routine:
ob2(0) = 0ob2(1) = 0ob2(2) = 1ob2(3) = 2ob2(4) = 2ob2(5) = 3… and so on.
bits_per(n)

calculate the number of bits required for the argument

Parameters

n
parameter

Description

This is constant-capable and can be used for compile timeinitializations, e.g bitfields.

The first few values calculated by this routine:bf(0) = 1bf(1) = 1bf(2) = 2bf(3) = 2bf(4) = 3… and so on.

Integer power Functions

u64int_pow(u64 base, unsigned int exp)

computes the exponentiation of the given base and exponent

Parameters

u64base
base which will be raised to the given power
unsignedintexp
power to be raised to

Description

Computes: pow(base, exp), i.e.base raised to theexp power

unsigned longint_sqrt(unsigned long x)

computes the integer square root

Parameters

unsignedlongx
integer of which to calculate the sqrt

Description

Computes: floor(sqrt(x))

u32int_sqrt64(u64 x)

strongly typed int_sqrt function when minimum 64 bit input is expected.

Parameters

u64x
64bit integer of which to calculate the sqrt

Division Functions

do_div(n,base)

returns 2 values: calculate remainder and update new dividend

Parameters

n
uint64_t dividend (will be updated)
base
uint32_t divisor

Description

Summary:uint32_tremainder=n%base;n=n/base;

Return

(uint32_t)remainder

NOTE

macro parametern is evaluated multiple times,beware of side effects!

u64div_u64_rem(u64 dividend, u32 divisor, u32 * remainder)

unsigned 64bit divide with 32bit divisor with remainder

Parameters

u64dividend
unsigned 64bit dividend
u32divisor
unsigned 32bit divisor
u32*remainder
pointer to unsigned 32bit remainder

Return

sets*remainder, then returns dividend / divisor

Description

This is commonly provided by 32bit archs to provide an optimized 64bitdivide.

s64div_s64_rem(s64 dividend, s32 divisor, s32 * remainder)

signed 64bit divide with 32bit divisor with remainder

Parameters

s64dividend
signed 64bit dividend
s32divisor
signed 32bit divisor
s32*remainder
pointer to signed 32bit remainder

Return

sets*remainder, then returns dividend / divisor

u64div64_u64_rem(u64 dividend, u64 divisor, u64 * remainder)

unsigned 64bit divide with 64bit divisor and remainder

Parameters

u64dividend
unsigned 64bit dividend
u64divisor
unsigned 64bit divisor
u64*remainder
pointer to unsigned 64bit remainder

Return

sets*remainder, then returns dividend / divisor

u64div64_u64(u64 dividend, u64 divisor)

unsigned 64bit divide with 64bit divisor

Parameters

u64dividend
unsigned 64bit dividend
u64divisor
unsigned 64bit divisor

Return

dividend / divisor

s64div64_s64(s64 dividend, s64 divisor)

signed 64bit divide with 64bit divisor

Parameters

s64dividend
signed 64bit dividend
s64divisor
signed 64bit divisor

Return

dividend / divisor

u64div_u64(u64 dividend, u32 divisor)

unsigned 64bit divide with 32bit divisor

Parameters

u64dividend
unsigned 64bit dividend
u32divisor
unsigned 32bit divisor

Description

This is the most common 64bit divide and should be used if possible,as many 32bit archs can optimize this variant better than a full 64bitdivide.

s64div_s64(s64 dividend, s32 divisor)

signed 64bit divide with 32bit divisor

Parameters

s64dividend
signed 64bit dividend
s32divisor
signed 32bit divisor
DIV64_U64_ROUND_CLOSEST(dividend,divisor)

unsigned 64bit divide with 64bit divisor rounded to nearest integer

Parameters

dividend
unsigned 64bit dividend
divisor
unsigned 64bit divisor

Description

Divide unsigned 64bit dividend by unsigned 64bit divisorand round to closest integer.

Return

dividend / divisor rounded to nearest integer

s64div_s64_rem(s64 dividend, s32 divisor, s32 * remainder)

signed 64bit divide with 64bit divisor and remainder

Parameters

s64dividend
64bit dividend
s32divisor
64bit divisor
s32*remainder
64bit remainder
u64div64_u64_rem(u64 dividend, u64 divisor, u64 * remainder)

unsigned 64bit divide with 64bit divisor and remainder

Parameters

u64dividend
64bit dividend
u64divisor
64bit divisor
u64*remainder
64bit remainder

Description

This implementation is a comparable to algorithm used by div64_u64.But this operation, which includes math for calculating the remainder,is kept distinct to avoid slowing down the div64_u64 operation on 32bitsystems.

u64div64_u64(u64 dividend, u64 divisor)

unsigned 64bit divide with 64bit divisor

Parameters

u64dividend
64bit dividend
u64divisor
64bit divisor

Description

This implementation is a modified version of the algorithm proposedby the book ‘Hacker’s Delight’. The original source and full proofcan be found here and is available for use without restriction.

http://www.hackersdelight.org/hdcodetxt/divDouble.c.txt

s64div64_s64(s64 dividend, s64 divisor)

signed 64bit divide with 64bit divisor

Parameters

s64dividend
64bit dividend
s64divisor
64bit divisor
unsigned longgcd(unsigned long a, unsigned long b)

calculate and return the greatest common divisor of 2 unsigned longs

Parameters

unsignedlonga
first value
unsignedlongb
second value

UUID/GUID

voidgenerate_random_uuid(unsigned char uuid)

generate a random UUID

Parameters

unsignedcharuuid
where to put the generated UUID

Description

Random UUID interface

Used to create a Boot ID or a filesystem UUID/GUID, but can beuseful for other kernel drivers.

booluuid_is_valid(const char * uuid)

checks if a UUID string is valid

Parameters

constchar*uuid
UUID string to check

Description

It checks if the UUID string is following the format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

where x is a hex digit.

Return

true if input is valid UUID string.

Kernel IPC facilities

IPC utilities

intipc_init(void)

initialise ipc subsystem

Parameters

void
no arguments

Description

The various sysv ipc resources (semaphores, messages and sharedmemory) are initialised.

A callback routine is registered into the memory hotplug notifierchain: since msgmni scales to lowmem this callback routine will becalled upon successful memory add / remove to recompute msmgni.

voidipc_init_ids(struct ipc_ids * ids)

initialise ipc identifiers

Parameters

structipc_ids*ids
ipc identifier set

Description

Set up the sequence range to use for the ipc identifier range (limitedbelow ipc_mni) then initialise the keys hashtable and ids idr.

voidipc_init_proc_interface(const char * path, const char * header, int ids, int (*show)(struct seq_file *, void *))

create a proc interface for sysipc types using a seq_file interface.

Parameters

constchar*path
Path in procfs
constchar*header
Banner to be printed at the beginning of the file.
intids
ipc id table to iterate.
int(*)(structseq_file*,void*)show
show routine.
struct kern_ipc_perm *ipc_findkey(struct ipc_ids * ids, key_t key)

find a key in an ipc identifier set

Parameters

structipc_ids*ids
ipc identifier set
key_tkey
key to find

Description

Returns the locked pointer to the ipc structure if found or NULLotherwise. If key is found ipc points to the owning ipc structure

Called with writer ipc_ids.rwsem held.

intipc_addid(struct ipc_ids * ids, struct kern_ipc_perm * new, int limit)

add an ipc identifier

Parameters

structipc_ids*ids
ipc identifier set
structkern_ipc_perm*new
new ipc permission set
intlimit
limit for the number of used ids

Description

Add an entry ‘new’ to the ipc ids idr. The permissions object isinitialised and the first free entry is set up and the index assignedis returned. The ‘new’ entry is returned in a locked state on success.

On failure the entry is not locked and a negative err-code is returned.The caller must use ipc_rcu_putref() to free the identifier.

Called with writer ipc_ids.rwsem held.

intipcget_new(struct ipc_namespace * ns, struct ipc_ids * ids, const struct ipc_ops * ops, struct ipc_params * params)

create a new ipc object

Parameters

structipc_namespace*ns
ipc namespace
structipc_ids*ids
ipc identifier set
conststructipc_ops*ops
the actual creation routine to call
structipc_params*params
its parameters

Description

This routine is called by sys_msgget, sys_semget() and sys_shmget()when the key is IPC_PRIVATE.

intipc_check_perms(struct ipc_namespace * ns, struct kern_ipc_perm * ipcp, const struct ipc_ops * ops, struct ipc_params * params)

check security and permissions for an ipc object

Parameters

structipc_namespace*ns
ipc namespace
structkern_ipc_perm*ipcp
ipc permission set
conststructipc_ops*ops
the actual security routine to call
structipc_params*params
its parameters

Description

This routine is called by sys_msgget(), sys_semget() and sys_shmget()when the key is not IPC_PRIVATE and that key already exists in theds IDR.

On success, the ipc id is returned.

It is called with ipc_ids.rwsem and ipcp->lock held.

intipcget_public(struct ipc_namespace * ns, struct ipc_ids * ids, const struct ipc_ops * ops, struct ipc_params * params)

get an ipc object or create a new one

Parameters

structipc_namespace*ns
ipc namespace
structipc_ids*ids
ipc identifier set
conststructipc_ops*ops
the actual creation routine to call
structipc_params*params
its parameters

Description

This routine is called by sys_msgget, sys_semget() and sys_shmget()when the key is not IPC_PRIVATE.It adds a new entry if the key is not found and does some permission/ security checkings if the key is found.

On success, the ipc id is returned.

voidipc_kht_remove(struct ipc_ids * ids, struct kern_ipc_perm * ipcp)

remove an ipc from the key hashtable

Parameters

structipc_ids*ids
ipc identifier set
structkern_ipc_perm*ipcp
ipc perm structure containing the key to remove

Description

ipc_ids.rwsem (as a writer) and the spinlock for this ID are heldbefore this function is called, and remain locked on the exit.

voidipc_rmid(struct ipc_ids * ids, struct kern_ipc_perm * ipcp)

remove an ipc identifier

Parameters

structipc_ids*ids
ipc identifier set
structkern_ipc_perm*ipcp
ipc perm structure containing the identifier to remove

Description

ipc_ids.rwsem (as a writer) and the spinlock for this ID are heldbefore this function is called, and remain locked on the exit.

voidipc_set_key_private(struct ipc_ids * ids, struct kern_ipc_perm * ipcp)

switch the key of an existing ipc to IPC_PRIVATE

Parameters

structipc_ids*ids
ipc identifier set
structkern_ipc_perm*ipcp
ipc perm structure containing the key to modify

Description

ipc_ids.rwsem (as a writer) and the spinlock for this ID are heldbefore this function is called, and remain locked on the exit.

intipcperms(struct ipc_namespace * ns, struct kern_ipc_perm * ipcp, short flag)

check ipc permissions

Parameters

structipc_namespace*ns
ipc namespace
structkern_ipc_perm*ipcp
ipc permission set
shortflag
desired permission set

Description

Check user, group, other permissions for accessto ipc resources. return 0 if allowed

flag will most probably be 0 orS_...UGO from <linux/stat.h>

voidkernel_to_ipc64_perm(struct kern_ipc_perm * in, struct ipc64_perm * out)

convert kernel ipc permissions to user

Parameters

structkern_ipc_perm*in
kernel permissions
structipc64_perm*out
new style ipc permissions

Description

Turn the kernel objectin into a set of permissions descriptionsfor returning to userspace (out).

voidipc64_perm_to_ipc_perm(struct ipc64_perm * in, struct ipc_perm * out)

convert new ipc permissions to old

Parameters

structipc64_perm*in
new style ipc permissions
structipc_perm*out
old style ipc permissions

Description

Turn the new style permissions objectin into a compatibilityobject and store it into theout pointer.

struct kern_ipc_perm *ipc_obtain_object_idr(struct ipc_ids * ids, int id)

Parameters

structipc_ids*ids
ipc identifier set
intid
ipc id to look for

Description

Look for an id in the ipc ids idr and return associated ipc object.

Call inside the RCU critical section.The ipc object isnot locked on exit.

struct kern_ipc_perm *ipc_obtain_object_check(struct ipc_ids * ids, int id)

Parameters

structipc_ids*ids
ipc identifier set
intid
ipc id to look for

Description

Similar toipc_obtain_object_idr() but also checks the ipc objectsequence number.

Call inside the RCU critical section.The ipc object isnot locked on exit.

intipcget(struct ipc_namespace * ns, struct ipc_ids * ids, const struct ipc_ops * ops, struct ipc_params * params)

Common sys_*get() code

Parameters

structipc_namespace*ns
namespace
structipc_ids*ids
ipc identifier set
conststructipc_ops*ops
operations to be called on ipc object creation, permission checksand further checks
structipc_params*params
the parameters needed by the previous operations.

Description

Common routine called by sys_msgget(), sys_semget() and sys_shmget().

intipc_update_perm(struct ipc64_perm * in, struct kern_ipc_perm * out)

update the permissions of an ipc object

Parameters

structipc64_perm*in
the permission given as input.
structkern_ipc_perm*out
the permission of the ipc to set.
struct kern_ipc_perm *ipcctl_obtain_check(struct ipc_namespace * ns, struct ipc_ids * ids, int id, int cmd, struct ipc64_perm * perm, int extra_perm)

retrieve an ipc object and check permissions

Parameters

structipc_namespace*ns
ipc namespace
structipc_ids*ids
the table of ids where to look for the ipc
intid
the id of the ipc to retrieve
intcmd
the cmd to check
structipc64_perm*perm
the permission to set
intextra_perm
one extra permission parameter used by msq

Description

This function does some common audit and permissions check for some IPC_XXXcmd and is called from semctl_down, shmctl_down and msgctl_down.

It:
  • retrieves the ipc object with the given id in the given table.
  • performs some audit and permission check, depending on the given cmd
  • returns a pointer to the ipc object or otherwise, the correspondingerror.

Call holding the both the rwsem and the rcu read lock.

intipc_parse_version(int * cmd)

ipc call version

Parameters

int*cmd
pointer to command

Description

Return IPC_64 for new style IPC and IPC_OLD for old style IPC.Thecmd value is turned from an encoding command and version intojust the command code.

FIFO Buffer

kfifo interface

DECLARE_KFIFO_PTR(fifo,type)

macro to declare a fifo pointer object

Parameters

fifo
name of the declared fifo
type
type of the fifo elements
DECLARE_KFIFO(fifo,type,size)

macro to declare a fifo object

Parameters

fifo
name of the declared fifo
type
type of the fifo elements
size
the number of elements in the fifo, this must be a power of 2
INIT_KFIFO(fifo)

Initialize a fifo declared by DECLARE_KFIFO

Parameters

fifo
name of the declared fifo datatype
DEFINE_KFIFO(fifo,type,size)

macro to define and initialize a fifo

Parameters

fifo
name of the declared fifo datatype
type
type of the fifo elements
size
the number of elements in the fifo, this must be a power of 2

Note

the macro can be used for global and local fifo data type variables.

kfifo_initialized(fifo)

Check if the fifo is initialized

Parameters

fifo
address of the fifo to check

Description

Returntrue if fifo is initialized, otherwisefalse.Assumes the fifo was 0 before.

kfifo_esize(fifo)

returns the size of the element managed by the fifo

Parameters

fifo
address of the fifo to be used
kfifo_recsize(fifo)

returns the size of the record length field

Parameters

fifo
address of the fifo to be used
kfifo_size(fifo)

returns the size of the fifo in elements

Parameters

fifo
address of the fifo to be used
kfifo_reset(fifo)

removes the entire fifo content

Parameters

fifo
address of the fifo to be used

Note

usage ofkfifo_reset() is dangerous. It should be only called when thefifo is exclusived locked or when it is secured that no other thread isaccessing the fifo.

kfifo_reset_out(fifo)

skip fifo content

Parameters

fifo
address of the fifo to be used

Note

The usage ofkfifo_reset_out() is safe until it will be only calledfrom the reader thread and there is only one concurrent reader. Otherwiseit is dangerous and must be handled in the same way askfifo_reset().

kfifo_len(fifo)

returns the number of used elements in the fifo

Parameters

fifo
address of the fifo to be used
kfifo_is_empty(fifo)

returns true if the fifo is empty

Parameters

fifo
address of the fifo to be used
kfifo_is_empty_spinlocked(fifo,lock)

returns true if the fifo is empty using a spinlock for locking

Parameters

fifo
address of the fifo to be used
lock
spinlock to be used for locking
kfifo_is_empty_spinlocked_noirqsave(fifo,lock)

returns true if the fifo is empty using a spinlock for locking, doesn’t disable interrupts

Parameters

fifo
address of the fifo to be used
lock
spinlock to be used for locking
kfifo_is_full(fifo)

returns true if the fifo is full

Parameters

fifo
address of the fifo to be used
kfifo_avail(fifo)

returns the number of unused elements in the fifo

Parameters

fifo
address of the fifo to be used
kfifo_skip(fifo)

skip output data

Parameters

fifo
address of the fifo to be used
kfifo_peek_len(fifo)

gets the size of the next fifo record

Parameters

fifo
address of the fifo to be used

Description

This function returns the size of the next fifo record in number of bytes.

kfifo_alloc(fifo,size,gfp_mask)

dynamically allocates a new fifo buffer

Parameters

fifo
pointer to the fifo
size
the number of elements in the fifo, this must be a power of 2
gfp_mask
get_free_pages mask, passed tokmalloc()

Description

This macro dynamically allocates a new fifo buffer.

The number of elements will be rounded-up to a power of 2.The fifo will be release withkfifo_free().Return 0 if no error, otherwise an error code.

kfifo_free(fifo)

frees the fifo

Parameters

fifo
the fifo to be freed
kfifo_init(fifo,buffer,size)

initialize a fifo using a preallocated buffer

Parameters

fifo
the fifo to assign the buffer
buffer
the preallocated buffer to be used
size
the size of the internal buffer, this have to be a power of 2

Description

This macro initializes a fifo using a preallocated buffer.

The number of elements will be rounded-up to a power of 2.Return 0 if no error, otherwise an error code.

kfifo_put(fifo,val)

put data into the fifo

Parameters

fifo
address of the fifo to be used
val
the data to be added

Description

This macro copies the given value into the fifo.It returns 0 if the fifo was full. Otherwise it returns the numberprocessed elements.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macro.

kfifo_get(fifo,val)

get data from the fifo

Parameters

fifo
address of the fifo to be used
val
address where to store the data

Description

This macro reads the data from the fifo.It returns 0 if the fifo was empty. Otherwise it returns the numberprocessed elements.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macro.

kfifo_peek(fifo,val)

get data from the fifo without removing

Parameters

fifo
address of the fifo to be used
val
address where to store the data

Description

This reads the data from the fifo without removing it from the fifo.It returns 0 if the fifo was empty. Otherwise it returns the numberprocessed elements.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macro.

kfifo_in(fifo,buf,n)

put data into the fifo

Parameters

fifo
address of the fifo to be used
buf
the data to be added
n
number of elements to be added

Description

This macro copies the given buffer into the fifo and returns thenumber of copied elements.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macro.

kfifo_in_spinlocked(fifo,buf,n,lock)

put data into the fifo using a spinlock for locking

Parameters

fifo
address of the fifo to be used
buf
the data to be added
n
number of elements to be added
lock
pointer to the spinlock to use for locking

Description

This macro copies the given values buffer into the fifo and returns thenumber of copied elements.

kfifo_in_spinlocked_noirqsave(fifo,buf,n,lock)

put data into fifo using a spinlock for locking, don’t disable interrupts

Parameters

fifo
address of the fifo to be used
buf
the data to be added
n
number of elements to be added
lock
pointer to the spinlock to use for locking

Description

This is a variant ofkfifo_in_spinlocked() but uses spin_lock/unlock()for locking and doesn’t disable interrupts.

kfifo_out(fifo,buf,n)

get data from the fifo

Parameters

fifo
address of the fifo to be used
buf
pointer to the storage buffer
n
max. number of elements to get

Description

This macro get some data from the fifo and return the numbers of elementscopied.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macro.

kfifo_out_spinlocked(fifo,buf,n,lock)

get data from the fifo using a spinlock for locking

Parameters

fifo
address of the fifo to be used
buf
pointer to the storage buffer
n
max. number of elements to get
lock
pointer to the spinlock to use for locking

Description

This macro get the data from the fifo and return the numbers of elementscopied.

kfifo_out_spinlocked_noirqsave(fifo,buf,n,lock)

get data from the fifo using a spinlock for locking, don’t disable interrupts

Parameters

fifo
address of the fifo to be used
buf
pointer to the storage buffer
n
max. number of elements to get
lock
pointer to the spinlock to use for locking

Description

This is a variant ofkfifo_out_spinlocked() which uses spin_lock/unlock()for locking and doesn’t disable interrupts.

kfifo_from_user(fifo,from,len,copied)

puts some data from user space into the fifo

Parameters

fifo
address of the fifo to be used
from
pointer to the data to be added
len
the length of the data to be added
copied
pointer to output variable to store the number of copied bytes

Description

This macro copies at mostlen bytes from thefrom into thefifo, depending of the available space and returns -EFAULT/0.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macro.

kfifo_to_user(fifo,to,len,copied)

copies data from the fifo into user space

Parameters

fifo
address of the fifo to be used
to
where the data must be copied
len
the size of the destination buffer
copied
pointer to output variable to store the number of copied bytes

Description

This macro copies at mostlen bytes from the fifo into theto buffer and returns -EFAULT/0.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macro.

kfifo_dma_in_prepare(fifo,sgl,nents,len)

setup a scatterlist for DMA input

Parameters

fifo
address of the fifo to be used
sgl
pointer to the scatterlist array
nents
number of entries in the scatterlist array
len
number of elements to transfer

Description

This macro fills a scatterlist for DMA input.It returns the number entries in the scatterlist array.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macros.

kfifo_dma_in_finish(fifo,len)

finish a DMA IN operation

Parameters

fifo
address of the fifo to be used
len
number of bytes to received

Description

This macro finish a DMA IN operation. The in counter will be updated bythe len parameter. No error checking will be done.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macros.

kfifo_dma_out_prepare(fifo,sgl,nents,len)

setup a scatterlist for DMA output

Parameters

fifo
address of the fifo to be used
sgl
pointer to the scatterlist array
nents
number of entries in the scatterlist array
len
number of elements to transfer

Description

This macro fills a scatterlist for DMA output which at mostlen bytesto transfer.It returns the number entries in the scatterlist array.A zero means there is no space available and the scatterlist is not filled.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macros.

kfifo_dma_out_finish(fifo,len)

finish a DMA OUT operation

Parameters

fifo
address of the fifo to be used
len
number of bytes transferred

Description

This macro finish a DMA OUT operation. The out counter will be updated bythe len parameter. No error checking will be done.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macros.

kfifo_out_peek(fifo,buf,n)

gets some data from the fifo

Parameters

fifo
address of the fifo to be used
buf
pointer to the storage buffer
n
max. number of elements to get

Description

This macro get the data from the fifo and return the numbers of elementscopied. The data is not removed from the fifo.

Note that with only one concurrent reader and one concurrentwriter, you don’t need extra locking to use these macro.

relay interface support

Relay interface support is designed to provide an efficient mechanismfor tools and facilities to relay large amounts of data from kernelspace to user space.

relay interface

intrelay_buf_full(struct rchan_buf * buf)

boolean, is the channel buffer full?

Parameters

structrchan_buf*buf

channel buffer

Returns 1 if the buffer is full, 0 otherwise.

voidrelay_reset(struct rchan * chan)

reset the channel

Parameters

structrchan*chan

the channel

This has the effect of erasing all data from all channel buffersand restarting the channel in its initial state. The buffersare not freed, so any mappings are still in effect.

NOTE. Care should be taken that the channel isn’t actuallybeing used by anything when this call is made.

struct rchan *relay_open(const char * base_filename, struct dentry * parent, size_t subbuf_size, size_t n_subbufs, struct rchan_callbacks * cb, void * private_data)

create a new relay channel

Parameters

constchar*base_filename
base name of files to create,NULL for buffering only
structdentry*parent
dentry of parent directory,NULL for root directory or buffer
size_tsubbuf_size
size of sub-buffers
size_tn_subbufs
number of sub-buffers
structrchan_callbacks*cb
client callback functions
void*private_data

user-defined data

Returns channel pointer if successful,NULL otherwise.

Creates a channel buffer for each cpu using the sizes andattributes specified. The created channel buffer fileswill be named base_filename0…base_filenameN-1. Filepermissions will beS_IRUSR.

If opening a buffer (parent = NULL) that you later wish to registerin a filesystem, callrelay_late_setup_files() once theparent dentryis available.

intrelay_late_setup_files(struct rchan * chan, const char * base_filename, struct dentry * parent)

triggers file creation

Parameters

structrchan*chan
channel to operate on
constchar*base_filename
base name of files to create
structdentry*parent

dentry of parent directory,NULL for root directory

Returns 0 if successful, non-zero otherwise.

Use to setup files for a previously buffer-only channel createdbyrelay_open() with a NULL parent dentry.

For example, this is useful for perfomring early tracing in kernel,before VFS is up and then exposing the early results once the dentryis available.

size_trelay_switch_subbuf(struct rchan_buf * buf, size_t length)

switch to a new sub-buffer

Parameters

structrchan_buf*buf
channel buffer
size_tlength

size of current event

Returns either the length passed in or 0 if full.

Performs sub-buffer-switch tasks such as invoking callbacks,updating padding counts, waking up readers, etc.

voidrelay_subbufs_consumed(struct rchan * chan, unsigned int cpu, size_t subbufs_consumed)

update the buffer’s sub-buffers-consumed count

Parameters

structrchan*chan
the channel
unsignedintcpu
the cpu associated with the channel buffer to update
size_tsubbufs_consumed

number of sub-buffers to add to current buf’s count

Adds to the channel buffer’s consumed sub-buffer count.subbufs_consumed should be the number of sub-buffers newly consumed,not the total consumed.

NOTE. Kernel clients don’t need to call this function if the channelmode is ‘overwrite’.

voidrelay_close(struct rchan * chan)

close the channel

Parameters

structrchan*chan

the channel

Closes all channel buffers and frees the channel.

voidrelay_flush(struct rchan * chan)

close the channel

Parameters

structrchan*chan

the channel

Flushes all channel buffers, i.e. forces buffer switch.

intrelay_mmap_buf(struct rchan_buf * buf, struct vm_area_struct * vma)

mmap channel buffer to process address space

Parameters

structrchan_buf*buf
relay channel buffer
structvm_area_struct*vma

vm_area_struct describing memory to be mapped

Returns 0 if ok, negative on error

Caller should already have grabbed mmap_lock.

void *relay_alloc_buf(struct rchan_buf * buf, size_t * size)

allocate a channel buffer

Parameters

structrchan_buf*buf
the buffer struct
size_t*size

total size of the buffer

Returns a pointer to the resulting buffer,NULL if unsuccessful. Thepassed in size will get page aligned, if it isn’t already.

struct rchan_buf *relay_create_buf(struct rchan * chan)

allocate and initialize a channel buffer

Parameters

structrchan*chan

the relay channel

Returns channel buffer if successful,NULL otherwise.

voidrelay_destroy_channel(struct kref * kref)

free the channel struct

Parameters

structkref*kref

target kernel reference that contains the relay channel

Should only be called from kref_put().

voidrelay_destroy_buf(struct rchan_buf * buf)

destroy an rchan_buf struct and associated buffer

Parameters

structrchan_buf*buf
the buffer struct
voidrelay_remove_buf(struct kref * kref)

remove a channel buffer

Parameters

structkref*kref

target kernel reference that contains the relay buffer

Removes the file from the filesystem, which also frees therchan_buf_struct and the channel buffer. Should only be called fromkref_put().

intrelay_buf_empty(struct rchan_buf * buf)

boolean, is the channel buffer empty?

Parameters

structrchan_buf*buf

channel buffer

Returns 1 if the buffer is empty, 0 otherwise.

voidwakeup_readers(struct irq_work * work)

wake up readers waiting on a channel

Parameters

structirq_work*work

contains the channel buffer

This is the function used to defer reader waking

void__relay_reset(struct rchan_buf * buf, unsigned int init)

reset a channel buffer

Parameters

structrchan_buf*buf
the channel buffer
unsignedintinit

1 if this is a first-time initialization

Seerelay_reset() for description of effect.

voidrelay_close_buf(struct rchan_buf * buf)

close a channel buffer

Parameters

structrchan_buf*buf

channel buffer

Marks the buffer finalized and restores the default callbacks.The channel buffer and channel buffer data structure are then freedautomatically when the last reference is given up.

intrelay_file_open(struct inode * inode, struct file * filp)

open file op for relay files

Parameters

structinode*inode
the inode
structfile*filp

the file

Increments the channel buffer refcount.

intrelay_file_mmap(struct file * filp, struct vm_area_struct * vma)

mmap file op for relay files

Parameters

structfile*filp
the file
structvm_area_struct*vma

the vma describing what to map

Calls uponrelay_mmap_buf() to map the file into user space.

__poll_trelay_file_poll(struct file * filp, poll_table * wait)

poll file op for relay files

Parameters

structfile*filp
the file
poll_table*wait

poll table

Poll implemention.

intrelay_file_release(struct inode * inode, struct file * filp)

release file op for relay files

Parameters

structinode*inode
the inode
structfile*filp

the file

Decrements the channel refcount, as the filesystem isno longer using it.

size_trelay_file_read_subbuf_avail(size_t read_pos, struct rchan_buf * buf)

return bytes available in sub-buffer

Parameters

size_tread_pos
file read position
structrchan_buf*buf
relay channel buffer
size_trelay_file_read_start_pos(struct rchan_buf * buf)

find the first available byte to read

Parameters

structrchan_buf*buf

relay channel buffer

If the read_pos is in the middle of padding, return theposition of the first actually available byte, otherwisereturn the original value.

size_trelay_file_read_end_pos(struct rchan_buf * buf, size_t read_pos, size_t count)

return the new read position

Parameters

structrchan_buf*buf
relay channel buffer
size_tread_pos
file read position
size_tcount
number of bytes to be read

Module Support

Module Loading

int__request_module(bool wait, const char * fmt, ...)

try to load a kernel module

Parameters

boolwait
wait (or not) for the operation to complete
constchar*fmt
printf style format string for the name of the module
...
arguments as specified in the format string

Description

Load a module using the user mode module loader. The function returnszero on success or a negative errno code or positive exit code from“modprobe” on failure. Note that a successful module load does not meanthe module did not then unload and exit on an error of its own. Callersmust check that the service they requested is now available not blindlyinvoke it.

If module auto-loading support is disabled then this functionsimply returns -ENOENT.

Inter Module support

Refer to the file kernel/module.c for more information.

Hardware Interfaces

Interrupt Handling

boolsynchronize_hardirq(unsigned int irq)

wait for pending hard IRQ handlers (on other CPUs)

Parameters

unsignedintirq

interrupt number to wait for

This function waits for any pending hard IRQ handlers for thisinterrupt to complete before returning. If you use thisfunction while holding a resource the IRQ handler may need youwill deadlock. It does not take associated threaded handlersinto account.

Do not use this for shutdown scenarios where you must be surethat all parts (hardirq and threaded handler) have completed.

Return

false if a threaded handler is active.

This function may be called - with care - from IRQ context.

It does not check whether there is an interrupt in flight at thehardware level, but not serviced yet, as this might deadlock whencalled with interrupts disabled and the target CPU of the interruptis the current CPU.

voidsynchronize_irq(unsigned int irq)

wait for pending IRQ handlers (on other CPUs)

Parameters

unsignedintirq

interrupt number to wait for

This function waits for any pending IRQ handlers for this interruptto complete before returning. If you use this function whileholding a resource the IRQ handler may need you will deadlock.

Can only be called from preemptible code as it might sleep whenan interrupt thread is associated toirq.

It optionally makes sure (when the irq chip supports that method)that the interrupt is not pending in any CPU and waiting forservice.

intirq_set_affinity_notifier(unsigned int irq, structirq_affinity_notify * notify)

control notification of IRQ affinity changes

Parameters

unsignedintirq
Interrupt for which to enable/disable notification
structirq_affinity_notify*notify

Context for notification, orNULL to disablenotification. Function pointers must be initialised;the other fields will be initialised by this function.

Must be called in process context. Notification may only be enabledafter the IRQ is allocated and must be disabled before the IRQ isfreed usingfree_irq().
intirq_set_vcpu_affinity(unsigned int irq, void * vcpu_info)

Set vcpu affinity for the interrupt

Parameters

unsignedintirq
interrupt number to set affinity
void*vcpu_info

vCPU specific data or pointer to a percpu array of vCPUspecific data for percpu_devid interrupts

This function uses the vCPU specific data to set the vCPUaffinity for an irq. The vCPU specific data is passed fromoutside, such as KVM. One example code path is as below:KVM -> IOMMU ->irq_set_vcpu_affinity().
voiddisable_irq_nosync(unsigned int irq)

disable an irq without waiting

Parameters

unsignedintirq

Interrupt to disable

Disable the selected interrupt line. Disables and Enables arenested.Unlikedisable_irq(), this function does not ensure existinginstances of the IRQ handler have completed before returning.

This function may be called from IRQ context.

voiddisable_irq(unsigned int irq)

disable an irq and wait for completion

Parameters

unsignedintirq

Interrupt to disable

Disable the selected interrupt line. Enables and Disables arenested.This function waits for any pending IRQ handlers for this interruptto complete before returning. If you use this function whileholding a resource the IRQ handler may need you will deadlock.

This function may be called - with care - from IRQ context.

booldisable_hardirq(unsigned int irq)

disables an irq and waits for hardirq completion

Parameters

unsignedintirq

Interrupt to disable

Disable the selected interrupt line. Enables and Disables arenested.This function waits for any pending hard IRQ handlers for thisinterrupt to complete before returning. If you use this function whileholding a resource the hard IRQ handler may need you will deadlock.

When used to optimistically disable an interrupt from atomic contextthe return value must be checked.

Return

false if a threaded handler is active.

This function may be called - with care - from IRQ context.
voidenable_irq(unsigned int irq)

enable handling of an irq

Parameters

unsignedintirq

Interrupt to enable

Undoes the effect of one call todisable_irq(). If thismatches the last disable, processing of interrupts on thisIRQ line is re-enabled.

This function may be called from IRQ context only whendesc->irq_data.chip->bus_lock and desc->chip->bus_sync_unlock are NULL !

intirq_set_irq_wake(unsigned int irq, unsigned int on)

control irq power management wakeup

Parameters

unsignedintirq
interrupt to control
unsignedinton

enable/disable power management wakeup

Enable/disable power management wakeup mode, which isdisabled by default. Enables and disables must match,just as they match for non-wakeup mode support.

Wakeup mode lets this IRQ wake the system from sleepstates like “suspend to RAM”.

Note

irq enable/disable state is completely orthogonal
to the enable/disable state of irq wake. An irq can bedisabled withdisable_irq() and still wake the system aslong as the irq has wake enabled. If this does not hold,then the underlying irq chip and the related driver needto be investigated.
voidirq_wake_thread(unsigned int irq, void * dev_id)

wake the irq thread for the action identified by dev_id

Parameters

unsignedintirq
Interrupt line
void*dev_id
Device identity for which the thread should be woken
const void *free_irq(unsigned int irq, void * dev_id)

free an interrupt allocated with request_irq

Parameters

unsignedintirq
Interrupt line to free
void*dev_id

Device identity to free

Remove an interrupt handler. The handler is removed and if theinterrupt line is no longer in use by any driver it is disabled.On a shared IRQ the caller must ensure the interrupt is disabledon the card it drives before calling this function. The functiondoes not return until any executing interrupts for this IRQhave completed.

This function must not be called from interrupt context.

Returns the devname argument passed to request_irq.

intrequest_threaded_irq(unsigned int irq, irq_handler_t handler, irq_handler_t thread_fn, unsigned long irqflags, const char * devname, void * dev_id)

allocate an interrupt line

Parameters

unsignedintirq
Interrupt line to allocate
irq_handler_thandler
Function to be called when the IRQ occurs.Primary handler for threaded interruptsIf NULL and thread_fn != NULL the defaultprimary handler is installed
irq_handler_tthread_fn
Function called from the irq handler threadIf NULL, no irq thread is created
unsignedlongirqflags
Interrupt type flags
constchar*devname
An ascii name for the claiming device
void*dev_id

A cookie passed back to the handler function

This call allocates interrupt resources and enables theinterrupt line and IRQ handling. From the point thiscall is made your handler function may be invoked. Sinceyour handler function must clear any interrupt the boardraises, you must take care both to initialise your hardwareand to set up the interrupt handler in the right order.

If you want to set up a threaded irq handler for your devicethen you need to supplyhandler andthread_fn.handler isstill called in hard interrupt context and has to checkwhether the interrupt originates from the device. If yes itneeds to disable the interrupt on the device and returnIRQ_WAKE_THREAD which will wake up the handler thread and runthread_fn. This split handler design is necessary to supportshared interrupts.

Dev_id must be globally unique. Normally the address of thedevice data structure is used as the cookie. Since the handlerreceives this value it makes sense to use it.

If your interrupt is shared you must pass a non NULL dev_idas this is required when freeing the interrupt.

Flags:

IRQF_SHARED Interrupt is sharedIRQF_TRIGGER_* Specify active edge(s) or level

intrequest_any_context_irq(unsigned int irq, irq_handler_t handler, unsigned long flags, const char * name, void * dev_id)

allocate an interrupt line

Parameters

unsignedintirq
Interrupt line to allocate
irq_handler_thandler
Function to be called when the IRQ occurs.Threaded handler for threaded interrupts.
unsignedlongflags
Interrupt type flags
constchar*name
An ascii name for the claiming device
void*dev_id

A cookie passed back to the handler function

This call allocates interrupt resources and enables theinterrupt line and IRQ handling. It selects either ahardirq or threaded handling method depending on thecontext.

On failure, it returns a negative value. On success,it returns either IRQC_IS_HARDIRQ or IRQC_IS_NESTED.

boolirq_percpu_is_enabled(unsigned int irq)

Check whether the per cpu irq is enabled

Parameters

unsignedintirq
Linux irq number to check for

Description

Must be called from a non migratable context. Returns the enablestate of a per cpu interrupt on the current cpu.

voidfree_percpu_irq(unsigned int irq, void __percpu * dev_id)

free an interrupt allocated with request_percpu_irq

Parameters

unsignedintirq
Interrupt line to free
void__percpu*dev_id

Device identity to free

Remove a percpu interrupt handler. The handler is removed, butthe interrupt line is not disabled. This must be done on eachCPU before calling this function. The function does not returnuntil any executing interrupts for this IRQ have completed.

This function must not be called from interrupt context.

int__request_percpu_irq(unsigned int irq, irq_handler_t handler, unsigned long flags, const char * devname, void __percpu * dev_id)

allocate a percpu interrupt line

Parameters

unsignedintirq
Interrupt line to allocate
irq_handler_thandler
Function to be called when the IRQ occurs.
unsignedlongflags
Interrupt type flags (IRQF_TIMER only)
constchar*devname
An ascii name for the claiming device
void__percpu*dev_id

A percpu cookie passed back to the handler function

This call allocates interrupt resources and enables theinterrupt on the local CPU. If the interrupt is supposed to beenabled on other CPUs, it has to be done on each CPU usingenable_percpu_irq().

Dev_id must be globally unique. It is a per-cpu variable, andthe handler gets called with the interrupted CPU’s instance ofthat variable.

intirq_get_irqchip_state(unsigned int irq, enum irqchip_irq_state which, bool * state)

returns the irqchip state of a interrupt.

Parameters

unsignedintirq
Interrupt line that is forwarded to a VM
enumirqchip_irq_statewhich
One of IRQCHIP_STATE_* the caller wants to know about
bool*state

a pointer to a boolean where the state is to be storeed

This call snapshots the internal irqchip state of aninterrupt, returning intostate the bit corresponding tostagewhich

This function should be called with preemption disabled if theinterrupt controller has per-cpu registers.

intirq_set_irqchip_state(unsigned int irq, enum irqchip_irq_state which, bool val)

set the state of a forwarded interrupt.

Parameters

unsignedintirq
Interrupt line that is forwarded to a VM
enumirqchip_irq_statewhich
State to be restored (one of IRQCHIP_STATE_*)
boolval

Value corresponding towhich

This call sets the internal irqchip state of an interrupt,depending on the value ofwhich.

This function should be called with preemption disabled if theinterrupt controller has per-cpu registers.

DMA Channels

intrequest_dma(unsigned int dmanr, const char * device_id)

request and reserve a system DMA channel

Parameters

unsignedintdmanr
DMA channel number
constchar*device_id
reserving device ID string, used in /proc/dma
voidfree_dma(unsigned int dmanr)

free a reserved system DMA channel

Parameters

unsignedintdmanr
DMA channel number

Resources Management

struct resource *request_resource_conflict(struct resource * root, struct resource * new)

request and reserve an I/O or memory resource

Parameters

structresource*root
root resource descriptor
structresource*new
resource descriptor desired by caller

Description

Returns 0 for success, conflict resource on error.

intfind_next_iomem_res(resource_size_t start, resource_size_t end, unsigned long flags, unsigned long desc, bool first_lvl, struct resource * res)

Parameters

resource_size_tstart
start address of the resource searched for
resource_size_tend
end address of same resource
unsignedlongflags
flags which the resource must have
unsignedlongdesc
descriptor the resource must have
boolfirst_lvl
walk only the first level children, if set
structresource*res
return ptr, if resource found

Description

caller must specifystart,end,flags, anddesc (which may beIORES_DESC_NONE).

If a resource is found, returns 0 and***res is overwritten with the partof the resource that’s within [**start..**end**]; if none is found, returns-ENODEV. Returns -EINVAL for invalid parameters.

This function walks the whole tree and not just first level childrenunlessfirst_lvl is true.

intreallocate_resource(struct resource * root, struct resource * old, resource_size_t newsize, struct resource_constraint * constraint)

allocate a slot in the resource tree given range & alignment. The resource will be relocated if the new size cannot be reallocated in the current location.

Parameters

structresource*root
root resource descriptor
structresource*old
resource descriptor desired by caller
resource_size_tnewsize
new size of the resource descriptor
structresource_constraint*constraint
the size and alignment constraints to be met.
struct resource *lookup_resource(struct resource * root, resource_size_t start)

find an existing resource by a resource start address

Parameters

structresource*root
root resource descriptor
resource_size_tstart
resource start address

Description

Returns a pointer to the resource if found, NULL otherwise

struct resource *insert_resource_conflict(struct resource * parent, struct resource * new)

Inserts resource in the resource tree

Parameters

structresource*parent
parent of the new resource
structresource*new
new resource to insert

Description

Returns 0 on success, conflict resource if the resource can’t be inserted.

This function is equivalent to request_resource_conflict when no conflicthappens. If a conflict happens, and the conflicting resourcesentirely fit within the range of the new resource, then the newresource is inserted and the conflicting resources become children ofthe new resource.

This function is intended for producers of resources, such as FW modulesand bus drivers.

voidinsert_resource_expand_to_fit(struct resource * root, struct resource * new)

Insert a resource into the resource tree

Parameters

structresource*root
root resource descriptor
structresource*new
new resource to insert

Description

Insert a resource into the resource tree, possibly expanding it in orderto make it encompass any conflicting resources.

resource_size_tresource_alignment(struct resource * res)

calculate resource’s alignment

Parameters

structresource*res
resource pointer

Description

Returns alignment on success, 0 (invalid alignment) on failure.

intrelease_mem_region_adjustable(struct resource * parent, resource_size_t start, resource_size_t size)

release a previously reserved memory region

Parameters

structresource*parent
parent resource descriptor
resource_size_tstart
resource start address
resource_size_tsize
resource region size

Description

This interface is intended for memory hot-delete. The requested regionis released from a currently busy memory resource. The requested regionmust either match exactly or fit into a single busy resource entry. Inthe latter case, the remaining resource is adjusted accordingly.Existing children of the busy memory resource must be immutable in therequest.

Note

  • Additional release conditions, such as overlapping region, can besupported after they are confirmed as valid cases.
  • When a busy memory resource gets split into two entries, the codeassumes that all children remain in the lower address entry forsimplicity. Enhance this logic when necessary.
intrequest_resource(struct resource * root, struct resource * new)

request and reserve an I/O or memory resource

Parameters

structresource*root
root resource descriptor
structresource*new
resource descriptor desired by caller

Description

Returns 0 for success, negative error code on error.

intrelease_resource(struct resource * old)

release a previously reserved resource

Parameters

structresource*old
resource pointer
intwalk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 end, void * arg, int (*func)(struct resource *, void *))

Parameters

unsignedlongdesc
I/O resource descriptor. Use IORES_DESC_NONE to skipdesc check.
unsignedlongflags
I/O resource flags
u64start
start addr
u64end
end addr
void*arg
function argument for the callbackfunc
int(*)(structresource*,void*)func
callback function that is called for each qualifying resource area

Description

ranges. This walks through whole tree and not just first level children.All the memory ranges which overlap start,end and also match flags anddesc are valid candidates.

NOTE

For a new descriptor search, define a new IORES_DESC in<linux/ioport.h> and set it in ‘desc’ of a target resource entry.

intregion_intersects(resource_size_t start, size_t size, unsigned long flags, unsigned long desc)

determine intersection of region with known resources

Parameters

resource_size_tstart
region start address
size_tsize
size of region
unsignedlongflags
flags of resource (in iomem_resource)
unsignedlongdesc
descriptor of resource (in iomem_resource) or IORES_DESC_NONE

Description

Check if the specified region partially overlaps or fully eclipses aresource identified byflags anddesc (optional with IORES_DESC_NONE).Return REGION_DISJOINT if the region does not overlapflags/desc,return REGION_MIXED if the region overlapsflags/desc and anotherresource, and return REGION_INTERSECTS if the region overlapsflags/descand no other defined resource. Note that REGION_INTERSECTS is alsoreturned in the case when the specified region overlaps RAM and undefinedmemory holes.

region_intersect() is used by memory remapping functions to ensurethe user is not remapping RAM and is a vast speed up over walkingthrough the resource table page by page.

intallocate_resource(struct resource * root, struct resource * new, resource_size_t size, resource_size_t min, resource_size_t max, resource_size_t align, resource_size_t (*alignf)(void *, const struct resource *, resource_size_t, resource_size_t), void * alignf_data)

allocate empty slot in the resource tree given range & alignment. The resource will be reallocated with a new size if it was already allocated

Parameters

structresource*root
root resource descriptor
structresource*new
resource descriptor desired by caller
resource_size_tsize
requested resource region size
resource_size_tmin
minimum boundary to allocate
resource_size_tmax
maximum boundary to allocate
resource_size_talign
alignment requested, in bytes
resource_size_t(*)(void*,conststructresource*,resource_size_t,resource_size_t)alignf
alignment function, optional, called if not NULL
void*alignf_data
arbitrary data to pass to thealignf function
intinsert_resource(struct resource * parent, struct resource * new)

Inserts a resource in the resource tree

Parameters

structresource*parent
parent of the new resource
structresource*new
new resource to insert

Description

Returns 0 on success, -EBUSY if the resource can’t be inserted.

This function is intended for producers of resources, such as FW modulesand bus drivers.

intremove_resource(struct resource * old)

Remove a resource in the resource tree

Parameters

structresource*old
resource to remove

Description

Returns 0 on success, -EINVAL if the resource is not valid.

This function removes a resource previously inserted byinsert_resource()orinsert_resource_conflict(), and moves the children (if any) up towhere they were before.insert_resource() andinsert_resource_conflict()insert a new resource, and move any conflicting resources down to thechildren of the new resource.

insert_resource(),insert_resource_conflict() andremove_resource() areintended for producers of resources, such as FW modules and bus drivers.

intadjust_resource(struct resource * res, resource_size_t start, resource_size_t size)

modify a resource’s start and size

Parameters

structresource*res
resource to modify
resource_size_tstart
new start value
resource_size_tsize
new size

Description

Given an existing resource, change its start and size to match thearguments. Returns 0 on success, -EBUSY if it can’t fit.Existing children of the resource are assumed to be immutable.

struct resource *__request_region(struct resource * parent, resource_size_t start, resource_size_t n, const char * name, int flags)

create a new busy resource region

Parameters

structresource*parent
parent resource descriptor
resource_size_tstart
resource start address
resource_size_tn
resource region size
constchar*name
reserving caller’s ID string
intflags
IO resource flags
void__release_region(struct resource * parent, resource_size_t start, resource_size_t n)

release a previously reserved resource region

Parameters

structresource*parent
parent resource descriptor
resource_size_tstart
resource start address
resource_size_tn
resource region size

Description

The described resource region must match a currently busy region.

intdevm_request_resource(structdevice * dev, struct resource * root, struct resource * new)

request and reserve an I/O or memory resource

Parameters

structdevice*dev
device for which to request the resource
structresource*root
root of the resource tree from which to request the resource
structresource*new
descriptor of the resource to request

Description

This is a device-managed version ofrequest_resource(). There is usuallyno need to release resources requested by this function explicitly sincethat will be taken care of when the device is unbound from its driver.If for some reason the resource needs to be released explicitly, becauseof ordering issues for example, drivers must calldevm_release_resource()rather than the regularrelease_resource().

When a conflict is detected between any existing resources and the newlyrequested resource, an error message will be printed.

Returns 0 on success or a negative error code on failure.

voiddevm_release_resource(structdevice * dev, struct resource * new)

release a previously requested resource

Parameters

structdevice*dev
device for which to release the resource
structresource*new
descriptor of the resource to release

Description

Releases a resource previously requested usingdevm_request_resource().

struct resource *devm_request_free_mem_region(structdevice * dev, struct resource * base, unsigned long size)

find free region for device private memory

Parameters

structdevice*dev
device struct to bind the resource to
structresource*base
resource tree to look in
unsignedlongsize
size in bytes of the device memory to add

Description

This function tries to find an empty range of physical address big enough tocontain the new resource, so that it can later be hotplugged as ZONE_DEVICEmemory, which in turn allocates struct pages.

MTRR Handling

intarch_phys_wc_add(unsigned long base, unsigned long size)

add a WC MTRR and handle errors if PAT is unavailable

Parameters

unsignedlongbase
Physical base address
unsignedlongsize
Size of region

Description

If PAT is available, this does nothing. If PAT is unavailable, itattempts to add a WC MTRR covering size bytes starting at base andlogs an error if this fails.

The called should provide a power of two size on an equivalentpower of two boundary.

Drivers must store the return value to pass to mtrr_del_wc_if_needed,but drivers should not try to interpret that return value.

Security Framework

intsecurity_init(void)

initializes the security framework

Parameters

void
no arguments

Description

This should be called early in the kernel initialization sequence.

voidsecurity_add_hooks(struct security_hook_list * hooks, int count, char * lsm)

Add a modules hooks to the hook lists.

Parameters

structsecurity_hook_list*hooks
the hooks to add
intcount
the number of hooks to add
char*lsm
the name of the security module

Description

Each LSM has to register its hooks with the infrastructure.

intlsm_cred_alloc(struct cred * cred, gfp_t gfp)

allocate a composite cred blob

Parameters

structcred*cred
the cred that needs a blob
gfp_tgfp
allocation type

Description

Allocate the cred blob for all the modules

Returns 0, or -ENOMEM if memory can’t be allocated.

voidlsm_early_cred(struct cred * cred)

during initialization allocate a composite cred blob

Parameters

structcred*cred
the cred that needs a blob

Description

Allocate the cred blob for all the modules

intlsm_file_alloc(struct file * file)

allocate a composite file blob

Parameters

structfile*file
the file that needs a blob

Description

Allocate the file blob for all the modules

Returns 0, or -ENOMEM if memory can’t be allocated.

intlsm_inode_alloc(struct inode * inode)

allocate a composite inode blob

Parameters

structinode*inode
the inode that needs a blob

Description

Allocate the inode blob for all the modules

Returns 0, or -ENOMEM if memory can’t be allocated.

intlsm_task_alloc(struct task_struct * task)

allocate a composite task blob

Parameters

structtask_struct*task
the task that needs a blob

Description

Allocate the task blob for all the modules

Returns 0, or -ENOMEM if memory can’t be allocated.

intlsm_ipc_alloc(struct kern_ipc_perm * kip)

allocate a composite ipc blob

Parameters

structkern_ipc_perm*kip
the ipc that needs a blob

Description

Allocate the ipc blob for all the modules

Returns 0, or -ENOMEM if memory can’t be allocated.

intlsm_msg_msg_alloc(struct msg_msg * mp)

allocate a composite msg_msg blob

Parameters

structmsg_msg*mp
the msg_msg that needs a blob

Description

Allocate the ipc blob for all the modules

Returns 0, or -ENOMEM if memory can’t be allocated.

voidlsm_early_task(struct task_struct * task)

during initialization allocate a composite task blob

Parameters

structtask_struct*task
the task that needs a blob

Description

Allocate the task blob for all the modules

struct dentry *securityfs_create_file(const char * name, umode_t mode, struct dentry * parent, void * data, const struct file_operations * fops)

create a file in the securityfs filesystem

Parameters

constchar*name
a pointer to a string containing the name of the file to create.
umode_tmode
the permission that the file should have
structdentry*parent
a pointer to the parent dentry for this file. This should be adirectory dentry if set. If this parameter isNULL, then thefile will be created in the root of the securityfs filesystem.
void*data
a pointer to something that the caller will want to get to lateron. The inode.i_private pointer will point to this value onthe open() call.
conststructfile_operations*fops
a pointer to a struct file_operations that should be used forthis file.

Description

This function creates a file in securityfs with the givenname.

This function returns a pointer to a dentry if it succeeds. Thispointer must be passed to thesecurityfs_remove() function when the file isto be removed (no automatic cleanup happens if your module is unloaded,you are responsible here). If an error occurs, the function will returnthe error value (via ERR_PTR).

If securityfs is not enabled in the kernel, the value-ENODEV isreturned.

struct dentry *securityfs_create_dir(const char * name, struct dentry * parent)

create a directory in the securityfs filesystem

Parameters

constchar*name
a pointer to a string containing the name of the directory tocreate.
structdentry*parent
a pointer to the parent dentry for this file. This should be adirectory dentry if set. If this parameter isNULL, then thedirectory will be created in the root of the securityfs filesystem.

Description

This function creates a directory in securityfs with the givenname.

This function returns a pointer to a dentry if it succeeds. Thispointer must be passed to thesecurityfs_remove() function when the file isto be removed (no automatic cleanup happens if your module is unloaded,you are responsible here). If an error occurs, the function will returnthe error value (via ERR_PTR).

If securityfs is not enabled in the kernel, the value-ENODEV isreturned.

struct dentry *securityfs_create_symlink(const char * name, struct dentry * parent, const char * target, const struct inode_operations * iops)

create a symlink in the securityfs filesystem

Parameters

constchar*name
a pointer to a string containing the name of the symlink tocreate.
structdentry*parent
a pointer to the parent dentry for the symlink. This should be adirectory dentry if set. If this parameter isNULL, then thedirectory will be created in the root of the securityfs filesystem.
constchar*target
a pointer to a string containing the name of the symlink’s target.If this parameter isNULL, then theiops parameter needs to besetup to handle .readlink and .get_link inode_operations.
conststructinode_operations*iops
a pointer to the struct inode_operations to use for the symlink. Ifthis parameter isNULL, then the default simple_symlink_inodeoperations will be used.

Description

This function creates a symlink in securityfs with the givenname.

This function returns a pointer to a dentry if it succeeds. Thispointer must be passed to thesecurityfs_remove() function when the file isto be removed (no automatic cleanup happens if your module is unloaded,you are responsible here). If an error occurs, the function will returnthe error value (via ERR_PTR).

If securityfs is not enabled in the kernel, the value-ENODEV isreturned.

voidsecurityfs_remove(struct dentry * dentry)

removes a file or directory from the securityfs filesystem

Parameters

structdentry*dentry
a pointer to a the dentry of the file or directory to be removed.

Description

This function removes a file or directory in securityfs that was previouslycreated with a call to another securityfs function (likesecurityfs_create_file() or variants thereof.)

This function is required to be called in order for the file to beremoved. No automatic cleanup of files will happen when a module isremoved; you are responsible here.

Audit Interfaces

struct audit_buffer *audit_log_start(struct audit_context * ctx, gfp_t gfp_mask, int type)

obtain an audit buffer

Parameters

structaudit_context*ctx
audit_context (may be NULL)
gfp_tgfp_mask
type of allocation
inttype
audit message type

Description

Returns audit_buffer pointer on success or NULL on error.

Obtain an audit buffer. This routine does locking to obtain theaudit buffer, but then no locking is required for calls toaudit_log_*format. If the task (ctx) is a task that is currently in asyscall, then the syscall is marked as auditable and an audit recordwill be written at syscall exit. If there is no associated task, thentask context (ctx) should be NULL.

voidaudit_log_format(struct audit_buffer * ab, const char * fmt, ...)

format a message into the audit buffer.

Parameters

structaudit_buffer*ab
audit_buffer
constchar*fmt
format string
...
optional parameters matchingfmt string

Description

All the work is done in audit_log_vformat.

voidaudit_log_end(struct audit_buffer * ab)

end one audit record

Parameters

structaudit_buffer*ab
the audit_buffer

Description

We can not do a netlink send inside an irq context because it blocks (lastarg, flags, is not set to MSG_DONTWAIT), so the audit buffer is placed on aqueue and a tasklet is scheduled to remove them from the queue outside theirq context. May be called in any context.

voidaudit_log(struct audit_context * ctx, gfp_t gfp_mask, int type, const char * fmt, ...)

Log an audit record

Parameters

structaudit_context*ctx
audit context
gfp_tgfp_mask
type of allocation
inttype
audit message type
constchar*fmt
format string to use
...
variable parameters matching the format string

Description

This is a convenience function that calls audit_log_start,audit_log_vformat, and audit_log_end. It may be calledin any context.

intaudit_alloc(struct task_struct * tsk)

allocate an audit context block for a task

Parameters

structtask_struct*tsk
task

Description

Filter on the task information and allocate a per-task audit contextif necessary. Doing so turns on system call auditing for thespecified task. This is called from copy_process, so no lock isneeded.

void__audit_free(struct task_struct * tsk)

free a per-task audit context

Parameters

structtask_struct*tsk
task whose audit context block to free

Description

Called from copy_process and do_exit

void__audit_syscall_entry(int major, unsigned long a1, unsigned long a2, unsigned long a3, unsigned long a4)

fill in an audit record at syscall entry

Parameters

intmajor
major syscall type (function)
unsignedlonga1
additional syscall register 1
unsignedlonga2
additional syscall register 2
unsignedlonga3
additional syscall register 3
unsignedlonga4
additional syscall register 4

Description

Fill in audit context at syscall entry. This only happens if theaudit context was created when the task was created and the state orfilters demand the audit context be built. If the state from theper-task filter or from the per-syscall filter is AUDIT_RECORD_CONTEXT,then the record will be written at syscall exit time (otherwise, itwill only be written if another part of the kernel requests that itbe written).

void__audit_syscall_exit(int success, long return_code)

deallocate audit context after a system call

Parameters

intsuccess
success value of the syscall
longreturn_code
return value of the syscall

Description

Tear down after system call. If the audit context has been marked asauditable (either because of the AUDIT_RECORD_CONTEXT state fromfiltering, or because some other part of the kernel wrote an auditmessage), then write out the syscall information. In call cases,free the names stored from getname().

struct filename *__audit_reusename(const __user char * uptr)

fill out filename with info from existing entry

Parameters

const__userchar*uptr
userland ptr to pathname

Description

Search the audit_names list for the current audit context. If there is anexisting entry with a matching “uptr” then return the filenameassociated with that audit_name. If not, return NULL.

void__audit_getname(struct filename * name)

add a name to the list

Parameters

structfilename*name
name to add

Description

Add a name to the list of audit names for this context.Called from fs/namei.c:getname().

void__audit_inode(struct filename * name, const struct dentry * dentry, unsigned int flags)

store the inode and device from a lookup

Parameters

structfilename*name
name being audited
conststructdentry*dentry
dentry being audited
unsignedintflags
attributes for this particular entry
intauditsc_get_stamp(struct audit_context * ctx, struct timespec64 * t, unsigned int * serial)

get local copies of audit_context values

Parameters

structaudit_context*ctx
audit_context for the task
structtimespec64*t
timespec64 to store time recorded in the audit_context
unsignedint*serial
serial value that is recorded in the audit_context

Description

Also sets the context as auditable.

void__audit_mq_open(int oflag, umode_t mode, struct mq_attr * attr)

record audit data for a POSIX MQ open

Parameters

intoflag
open flag
umode_tmode
mode bits
structmq_attr*attr
queue attributes
void__audit_mq_sendrecv(mqd_t mqdes, size_t msg_len, unsigned int msg_prio, const struct timespec64 * abs_timeout)

record audit data for a POSIX MQ timed send/receive

Parameters

mqd_tmqdes
MQ descriptor
size_tmsg_len
Message length
unsignedintmsg_prio
Message priority
conststructtimespec64*abs_timeout
Message timeout in absolute time
void__audit_mq_notify(mqd_t mqdes, const struct sigevent * notification)

record audit data for a POSIX MQ notify

Parameters

mqd_tmqdes
MQ descriptor
conststructsigevent*notification
Notification event
void__audit_mq_getsetattr(mqd_t mqdes, struct mq_attr * mqstat)

record audit data for a POSIX MQ get/set attribute

Parameters

mqd_tmqdes
MQ descriptor
structmq_attr*mqstat
MQ flags
void__audit_ipc_obj(struct kern_ipc_perm * ipcp)

record audit data for ipc object

Parameters

structkern_ipc_perm*ipcp
ipc permissions
void__audit_ipc_set_perm(unsigned long qbytes, uid_t uid, gid_t gid, umode_t mode)

record audit data for new ipc permissions

Parameters

unsignedlongqbytes
msgq bytes
uid_tuid
msgq user id
gid_tgid
msgq group id
umode_tmode
msgq mode (permissions)

Description

Called only after audit_ipc_obj().

int__audit_socketcall(int nargs, unsigned long * args)

record audit data for sys_socketcall

Parameters

intnargs
number of args, which should not be more than AUDITSC_ARGS.
unsignedlong*args
args array
void__audit_fd_pair(int fd1, int fd2)

record audit data for pipe and socketpair

Parameters

intfd1
the first file descriptor
intfd2
the second file descriptor
int__audit_sockaddr(int len, void * a)

record audit data for sys_bind, sys_connect, sys_sendto

Parameters

intlen
data length in user space
void*a
data address in kernel space

Description

Returns 0 for success or NULL context or < 0 on error.

intaudit_signal_info_syscall(struct task_struct * t)

record signal info for syscalls

Parameters

structtask_struct*t
task being signaled

Description

If the audit subsystem is being terminated, record the task (pid)and uid that is doing that.

int__audit_log_bprm_fcaps(struct linux_binprm * bprm, const struct cred * new, const struct cred * old)

store information about a loading bprm and relevant fcaps

Parameters

structlinux_binprm*bprm
pointer to the bprm being processed
conststructcred*new
the proposed new credentials
conststructcred*old
the old credentials

Description

Simply check if the proc already has the caps given by the file and if notstore the priv escalation info for later auditing at the end of the syscall

-Eric

void__audit_log_capset(const struct cred * new, const struct cred * old)

store information about the arguments to the capset syscall

Parameters

conststructcred*new
the new credentials
conststructcred*old
the old (current) credentials

Description

Record the arguments userspace sent to sys_capset for later printing by theaudit system if applicable

voidaudit_core_dumps(long signr)

record information about processes that end abnormally

Parameters

longsignr
signal value

Description

If a process ends with a core dump, something fishy is going on and weshould record the event for investigation.

voidaudit_seccomp(unsigned long syscall, long signr, int code)

record information about a seccomp action

Parameters

unsignedlongsyscall
syscall number
longsignr
signal value
intcode
the seccomp action

Description

Record the information associated with a seccomp action. Event filtering forseccomp actions that are not to be logged is done in seccomp_log().Therefore, this function forces auditing independent of the audit_enabledand dummy context state because seccomp actions should be logged even whenaudit is not in use.

intaudit_rule_change(int type, int seq, void * data, size_t datasz)

apply all rules to the specified message type

Parameters

inttype
audit message type
intseq
netlink audit message sequence (serial) number
void*data
payload data
size_tdatasz
size of payload data
intaudit_list_rules_send(structsk_buff * request_skb, int seq)

list the audit rules

Parameters

structsk_buff*request_skb
skb of request we are replying to (used to target the reply)
intseq
netlink audit message sequence (serial) number
intparent_len(const char * path)

find the length of the parent portion of a pathname

Parameters

constchar*path
pathname of which to determine length
intaudit_compare_dname_path(const struct qstr * dname, const char * path, int parentlen)

compare given dentry name with last component in given path. Return of 0 indicates a match.

Parameters

conststructqstr*dname
dentry name that we’re comparing
constchar*path
full pathname that we’re comparing
intparentlen
length of the parent if known. Passing in AUDIT_NAME_FULLhere indicates that we must compute this value.

Accounting Framework

longsys_acct(const char __user * name)

enable/disable process accounting

Parameters

constchar__user*name
file name for accounting records or NULL to shutdown accounting

Description

Returns 0 for success or negative errno values for failure.

sys_acct() is the only system call needed to implement processaccounting. It takes the name of the file where accounting recordsshould be written. If the filename is NULL, accounting will beshutdown.

voidacct_collect(long exitcode, int group_dead)

collect accounting information into pacct_struct

Parameters

longexitcode
task exit code
intgroup_dead
not 0, if this thread is the last one in the process.
voidacct_process(void)

Parameters

void
no arguments

Description

handles process accounting for an exiting task

Block Devices

voidblk_queue_flag_set(unsigned int flag, struct request_queue * q)

atomically set a queue flag

Parameters

unsignedintflag
flag to be set
structrequest_queue*q
request queue
voidblk_queue_flag_clear(unsigned int flag, struct request_queue * q)

atomically clear a queue flag

Parameters

unsignedintflag
flag to be cleared
structrequest_queue*q
request queue
boolblk_queue_flag_test_and_set(unsigned int flag, struct request_queue * q)

atomically test and set a queue flag

Parameters

unsignedintflag
flag to be set
structrequest_queue*q
request queue

Description

Returns the previous value offlag - 0 if the flag was not set and 1 ifthe flag was already set.

const char *blk_op_str(unsigned int op)

Return string XXX in the REQ_OP_XXX.

Parameters

unsignedintop
REQ_OP_XXX.

Description

Centralize block layer function to convert REQ_OP_XXX intostring format. Useful in the debugging and tracing bio or request. Forinvalid REQ_OP_XXX it returns string “UNKNOWN”.

voidblk_sync_queue(struct request_queue * q)

cancel any pending callbacks on a queue

Parameters

structrequest_queue*q
the queue

Description

The block layer may perform asynchronous callback activityon a queue, such as calling the unplug function after a timeout.A block device may call blk_sync_queue to ensure that anysuch activity is cancelled, thus allowing it to release resourcesthat the callbacks might use. The caller must already have made surethat its ->submit_bio will not re-add plugging prior to callingthis function.

This function does not cancel any asynchronous activity arisingout of elevator or throttling code. That would require elevator_exit()and blkcg_exit_queue() to be called with queue lock initialized.

voidblk_set_pm_only(struct request_queue * q)

increment pm_only counter

Parameters

structrequest_queue*q
request queue pointer
voidblk_put_queue(struct request_queue * q)

decrement the request_queue refcount

Parameters

structrequest_queue*q
the request_queue structure to decrement the refcount for

Description

Decrements the refcount of the request_queue kobject. When this reaches 0we’ll haveblk_release_queue() called.

Context

Any context, but the last reference must not be dropped fromatomic context.

voidblk_cleanup_queue(struct request_queue * q)

shutdown a request queue

Parameters

structrequest_queue*q
request queue to shutdown

Description

Markq DYING, drain all pending requests, markq DEAD, destroy andput it. All future requests will be failed immediately with -ENODEV.

Context

can sleep

boolblk_get_queue(struct request_queue * q)

increment the request_queue refcount

Parameters

structrequest_queue*q
the request_queue structure to increment the refcount for

Description

Increment the refcount of the request_queue kobject.

Context

Any context.

struct request *blk_get_request(struct request_queue * q, unsigned int op, blk_mq_req_flags_t flags)

allocate a request

Parameters

structrequest_queue*q
request queue to allocate a request for
unsignedintop
operation (REQ_OP_*) and REQ_* flags, e.g. REQ_SYNC.
blk_mq_req_flags_tflags
BLK_MQ_REQ_* flags, e.g. BLK_MQ_REQ_NOWAIT.
blk_qc_tsubmit_bio_noacct(struct bio * bio)

re-submit a bio to the block device layer for I/O

Parameters

structbio*bio
The bio describing the location in memory and on the device.

Description

This is a version ofsubmit_bio() that shall only be used for I/O that isresubmitted to lower level drivers by stacking block drivers. All filesystems and other upper level users of the block layer should usesubmit_bio() instead.

blk_qc_tsubmit_bio(struct bio * bio)

submit a bio to the block device layer for I/O

Parameters

structbio*bio
Thestructbio which describes the I/O

Description

submit_bio() is used to submit I/O requests to block devices. It is passed afully set upstructbio that describes the I/O that needs to be done. Thebio will be send to the device described by the bi_disk and bi_partno fields.

The success/failure status of the request, along with notification ofcompletion, is delivered asynchronously through the ->bi_end_io() callbackinbio. The bio must NOT be touched by thecaller until ->bi_end_io() hasbeen called.

blk_status_tblk_insert_cloned_request(struct request_queue * q, struct request * rq)

Helper for stacking drivers to submit a request

Parameters

structrequest_queue*q
the queue to submit the request
structrequest*rq
the request being queued
unsigned intblk_rq_err_bytes(const struct request * rq)

determine number of bytes till the next failure boundary

Parameters

conststructrequest*rq
request to examine

Description

A request could be merge of IOs which require different failurehandling. This function determines the number of bytes whichcan be failed from the beginning of the request withoutcrossing into area which need to be retried further.

Return

The number of bytes to fail.
boolblk_update_request(struct request * req, blk_status_t error, unsigned int nr_bytes)

Special helper function for request stacking drivers

Parameters

structrequest*req
the request being processed
blk_status_terror
block status code
unsignedintnr_bytes
number of bytes to completereq

Description

Ends I/O on a number of bytes attached toreq, but doesn’t completethe request structure even ifreq doesn’t have leftover.Ifreq has leftover, sets it up for the next range of segments.

This special helper function is only for request stacking drivers(e.g. request-based dm) so that they can handle partial completion.Actual device drivers should use blk_mq_end_request instead.

Passing the result of blk_rq_bytes() asnr_bytes guaranteesfalse return from this function.

Note

The RQF_SPECIAL_PAYLOAD flag is ignored on purpose in bothblk_rq_bytes() and inblk_update_request().

Return

false - this request doesn’t have any more datatrue - this request has more data
voidrq_flush_dcache_pages(struct request * rq)

Helper function to flush all pages in a request

Parameters

structrequest*rq
the request to be flushed

Description

Flush all pages inrq.
intblk_lld_busy(struct request_queue * q)

Check if underlying low-level drivers of a device are busy

Parameters

structrequest_queue*q
the queue of the device being checked

Description

Check if underlying low-level drivers of a device are busy.If the drivers want to export their busy state, they must set ownexporting function using blk_queue_lld_busy() first.

Basically, this function is used only by request stacking driversto stop dispatching requests to underlying devices when underlyingdevices are busy. This behavior helps more I/O merging on the queueof the request stacking driver and prevents I/O throughput regressionon burst I/O load.

Return

0 - Not busy (The request stacking driver should dispatch request)1 - Busy (The request stacking driver should stop dispatching request)
voidblk_rq_unprep_clone(struct request * rq)

Helper function to free all bios in a cloned request

Parameters

structrequest*rq
the clone request to be cleaned up

Description

Free all bios inrq for a cloned request.
intblk_rq_prep_clone(struct request * rq, struct request * rq_src, struct bio_set * bs, gfp_t gfp_mask, int (*bio_ctr)(struct bio *, struct bio *, void *), void * data)

Helper function to setup clone request

Parameters

structrequest*rq
the request to be setup
structrequest*rq_src
original request to be cloned
structbio_set*bs
bio_set that bios for clone are allocated from
gfp_tgfp_mask
memory allocation mask for bio
int(*)(structbio*,structbio*,void*)bio_ctr
setup function to be called for each clone bio.Returns0 for success, non0 for failure.
void*data
private data to be passed tobio_ctr

Description

Clones bios inrq_src torq, and copies attributes ofrq_src torq.Also, pages which the original bios are pointing to are not copiedand the cloned bios just point same pages.So cloned bios must be completed before original bios, which meansthe caller must completerq beforerq_src.
voidblk_start_plug(struct blk_plug * plug)

initialize blk_plug and track it inside the task_struct

Parameters

structblk_plug*plug
Thestructblk_plug that needs to be initialized

Description

blk_start_plug() indicates to the block layer an intent by the callerto submit multiple I/O requests in a batch. The block layer may usethis hint to defer submitting I/Os from the caller untilblk_finish_plug()is called. However, the block layer may choose to submit requestsbefore a call toblk_finish_plug() if the number of queued I/OsexceedsBLK_MAX_REQUEST_COUNT, or if the size of the I/O is larger thanBLK_PLUG_FLUSH_SIZE. The queued I/Os may also be submitted early ifthe task schedules (see below).

Tracking blk_plug inside the task_struct will help with auto-flushing thepending I/O should the task end up blocking betweenblk_start_plug() andblk_finish_plug(). This is important from a performance perspective, butalso ensures that we don’t deadlock. For instance, if the task is blockingfor a memory allocation, memory reclaim could end up wanting to free apage belonging to that request that is currently residing in our privateplug. By flushing the pending I/O when the process goes to sleep, we avoidthis kind of deadlock.

voidblk_finish_plug(struct blk_plug * plug)

mark the end of a batch of submitted I/O

Parameters

structblk_plug*plug
Thestructblk_plug passed toblk_start_plug()

Description

Indicate that a batch of I/O submissions is complete. This functionmust be paired with an initial call toblk_start_plug(). The intentis to allow the block layer to optimize I/O submission. See thedocumentation forblk_start_plug() for more information.

intblk_queue_enter(struct request_queue * q, blk_mq_req_flags_t flags)

try to increase q->q_usage_counter

Parameters

structrequest_queue*q
request queue pointer
blk_mq_req_flags_tflags
BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
boolblk_attempt_plug_merge(struct request_queue * q, struct bio * bio, unsigned int nr_segs, struct request ** same_queue_rq)

try to merge withcurrent’s plugged list

Parameters

structrequest_queue*q
request_queue new bio is being queued at
structbio*bio
new bio being queued
unsignedintnr_segs
number of segments inbio
structrequest**same_queue_rq
pointer tostructrequest that gets filled in whenanother request associated withq is found on the plug list(optional, may beNULL)

Description

Determine whetherbio being queued onq can be merged with a requestoncurrent’s plugged list. Returnstrue if merge was successful,otherwisefalse.

Plugging coalesces IOs from the same issuer for the same purpose withoutgoing throughq->queue_lock. As such it’s more of an issuing mechanismthan scheduling, and the request, while may have elvpriv data, is notadded on the elevator at this point. In addition, we don’t havereliable access to the elevator outside queue lock. Only check basicmerging parameters without querying the elevator.

Caller must ensure !blk_queue_nomerges(q) beforehand.

intblk_cloned_rq_check_limits(struct request_queue * q, struct request * rq)

Helper function to check a cloned request for the new queue limits

Parameters

structrequest_queue*q
the queue
structrequest*rq
the request being checked

Description

rq may have been made based on weaker limitations of upper-level queuesin request stacking drivers, and it may violate the limitation ofq.Since the block layer and the underlying device driver trustrqafter it is inserted toq, it should be checked againstq beforethe insertion using this generic function.

Request stacking drivers like request-based dm may change the queuelimits when retrying requests on other queues. Those requests needto be checked against the new queue limits again during dispatch.

intblk_rq_map_user_iov(struct request_queue * q, struct request * rq, struct rq_map_data * map_data, const struct iov_iter * iter, gfp_t gfp_mask)

map user data to a request, for passthrough requests

Parameters

structrequest_queue*q
request queue where request should be inserted
structrequest*rq
request to map data to
structrq_map_data*map_data
pointer to the rq_map_data holding pages (if necessary)
conststructiov_iter*iter
iovec iterator
gfp_tgfp_mask
memory allocation flags

Description

Data will be mapped directly for zero copy I/O, if possible. Otherwisea kernel bounce buffer is used.

A matchingblk_rq_unmap_user() must be issued at the end of I/O, whilestill in process context.

Note

The mapped bio may need to be bounced through blk_queue_bounce()
before being submitted to the device, as pages mapped may be out ofreach. It’s the callers responsibility to make sure this happens. Theoriginal bio must be passed back in toblk_rq_unmap_user() for properunmapping.
intblk_rq_unmap_user(struct bio * bio)

unmap a request with user data

Parameters

structbio*bio
start of bio list

Description

Unmap a rq previously mapped by blk_rq_map_user(). The caller mustsupply the original rq->bio from the blk_rq_map_user() return, sincethe I/O completion may have changed rq->bio.
intblk_rq_map_kern(struct request_queue * q, struct request * rq, void * kbuf, unsigned int len, gfp_t gfp_mask)

map kernel data to a request, for passthrough requests

Parameters

structrequest_queue*q
request queue where request should be inserted
structrequest*rq
request to fill
void*kbuf
the kernel buffer
unsignedintlen
length of user data
gfp_tgfp_mask
memory allocation flags

Description

Data will be mapped directly if possible. Otherwise a bouncebuffer is used. Can be called multiple times to append multiplebuffers.
voidblk_release_queue(struct kobject * kobj)

releases all allocated resources of the request_queue

Parameters

structkobject*kobj
pointer to a kobject, whose container is a request_queue

Description

This function releases all allocated resources of the request queue.

The struct request_queue refcount is incremented withblk_get_queue() anddecremented withblk_put_queue(). Once the refcount reaches 0 this functionis called.

For drivers that have a request_queue on a gendisk and added with__device_add_disk() the refcount to request_queue will reach 0 withthe lastput_disk() called by the driver. For drivers which don’t use__device_add_disk() this happens withblk_cleanup_queue().

Drivers exist which depend on the release of the request_queue to besynchronous, it should not be deferred.

Context

can sleep

voidblk_unregister_queue(struct gendisk * disk)

counterpart of blk_register_queue()

Parameters

structgendisk*disk
Disk of which the request queue should be unregistered from sysfs.

Note

the caller is responsible for guaranteeing that this function is calledafter blk_register_queue() has finished.

voidblk_set_default_limits(struct queue_limits * lim)

reset limits to default values

Parameters

structqueue_limits*lim
the queue_limits structure to reset

Description

Returns a queue_limit struct to its default state.
voidblk_set_stacking_limits(struct queue_limits * lim)

set default limits for stacking devices

Parameters

structqueue_limits*lim
the queue_limits structure to reset

Description

Returns a queue_limit struct to its default state. Should be usedby stacking drivers like DM that have no internal limits.
voidblk_queue_bounce_limit(struct request_queue * q, u64 max_addr)

set bounce buffer limit for queue

Parameters

structrequest_queue*q
the request queue for the device
u64max_addr
the maximum address the device can handle

Description

Different hardware can have different requirements as to what pagesit can do I/O directly to. A low level driver can callblk_queue_bounce_limit to have lower memory pages allocated as bouncebuffers for doing I/O to pages residing abovemax_addr.
voidblk_queue_max_hw_sectors(struct request_queue * q, unsigned int max_hw_sectors)

set max sectors for a request for this queue

Parameters

structrequest_queue*q
the request queue for the device
unsignedintmax_hw_sectors
max hardware sectors in the usual 512b unit

Description

Enables a low level driver to set a hard upper limit,max_hw_sectors, on the size of requests. max_hw_sectors is set bythe device driver based upon the capabilities of the I/Ocontroller.

max_dev_sectors is a hard limit imposed by the storage device forREAD/WRITE requests. It is set by the disk driver.

max_sectors is a soft limit imposed by the block layer forfilesystem type requests. This value can be overridden on aper-device basis in /sys/block/<device>/queue/max_sectors_kb.The soft limit can not exceed max_hw_sectors.

voidblk_queue_chunk_sectors(struct request_queue * q, unsigned int chunk_sectors)

set size of the chunk for this queue

Parameters

structrequest_queue*q
the request queue for the device
unsignedintchunk_sectors
chunk sectors in the usual 512b unit

Description

If a driver doesn’t want IOs to cross a given chunk size, it can setthis limit and prevent merging across chunks. Note that the chunk sizemust currently be a power-of-2 in sectors. Also note that the blocklayer must accept a page worth of data at any offset. So if thecrossing of chunks is a hard limitation in the driver, it must still beprepared to split single page bios.
voidblk_queue_max_discard_sectors(struct request_queue * q, unsigned int max_discard_sectors)

set max sectors for a single discard

Parameters

structrequest_queue*q
the request queue for the device
unsignedintmax_discard_sectors
maximum number of sectors to discard
voidblk_queue_max_write_same_sectors(struct request_queue * q, unsigned int max_write_same_sectors)

set max sectors for a single write same

Parameters

structrequest_queue*q
the request queue for the device
unsignedintmax_write_same_sectors
maximum number of sectors to write per command
voidblk_queue_max_write_zeroes_sectors(struct request_queue * q, unsigned int max_write_zeroes_sectors)

set max sectors for a single write zeroes

Parameters

structrequest_queue*q
the request queue for the device
unsignedintmax_write_zeroes_sectors
maximum number of sectors to write per command
voidblk_queue_max_zone_append_sectors(struct request_queue * q, unsigned int max_zone_append_sectors)

set max sectors for a single zone append

Parameters

structrequest_queue*q
the request queue for the device
unsignedintmax_zone_append_sectors
maximum number of sectors to write per command
voidblk_queue_max_segments(struct request_queue * q, unsigned short max_segments)

set max hw segments for a request for this queue

Parameters

structrequest_queue*q
the request queue for the device
unsignedshortmax_segments
max number of segments

Description

Enables a low level driver to set an upper limit on the number ofhw data segments in a request.
voidblk_queue_max_discard_segments(struct request_queue * q, unsigned short max_segments)

set max segments for discard requests

Parameters

structrequest_queue*q
the request queue for the device
unsignedshortmax_segments
max number of segments

Description

Enables a low level driver to set an upper limit on the number ofsegments in a discard request.
voidblk_queue_max_segment_size(struct request_queue * q, unsigned int max_size)

set max segment size for blk_rq_map_sg

Parameters

structrequest_queue*q
the request queue for the device
unsignedintmax_size
max size of segment in bytes

Description

Enables a low level driver to set an upper limit on the size of acoalesced segment
voidblk_queue_logical_block_size(struct request_queue * q, unsigned int size)

set logical block size for the queue

Parameters

structrequest_queue*q
the request queue for the device
unsignedintsize
the logical block size, in bytes

Description

This should be set to the lowest possible block size that thestorage device can address. The default of 512 covers mosthardware.
voidblk_queue_physical_block_size(struct request_queue * q, unsigned int size)

set physical block size for the queue

Parameters

structrequest_queue*q
the request queue for the device
unsignedintsize
the physical block size, in bytes

Description

This should be set to the lowest possible sector size that thehardware can operate on without reverting to read-modify-writeoperations.
voidblk_queue_alignment_offset(struct request_queue * q, unsigned int offset)

set physical block alignment offset

Parameters

structrequest_queue*q
the request queue for the device
unsignedintoffset
alignment offset in bytes

Description

Some devices are naturally misaligned to compensate for things likethe legacy DOS partition table 63-sector offset. Low-level driversshould call this function for devices whose first sector is notnaturally aligned.
voidblk_limits_io_min(struct queue_limits * limits, unsigned int min)

set minimum request size for a device

Parameters

structqueue_limits*limits
the queue limits
unsignedintmin
smallest I/O size in bytes

Description

Some devices have an internal block size bigger than the reportedhardware sector size. This function can be used to signal thesmallest I/O the device can perform without incurring a performancepenalty.
voidblk_queue_io_min(struct request_queue * q, unsigned int min)

set minimum request size for the queue

Parameters

structrequest_queue*q
the request queue for the device
unsignedintmin
smallest I/O size in bytes

Description

Storage devices may report a granularity or preferred minimum I/Osize which is the smallest request the device can perform withoutincurring a performance penalty. For disk drives this is often thephysical block size. For RAID arrays it is often the stripe chunksize. A properly aligned multiple of minimum_io_size is thepreferred request size for workloads where a high number of I/Ooperations is desired.
voidblk_limits_io_opt(struct queue_limits * limits, unsigned int opt)

set optimal request size for a device

Parameters

structqueue_limits*limits
the queue limits
unsignedintopt
smallest I/O size in bytes

Description

Storage devices may report an optimal I/O size, which is thedevice’s preferred unit for sustained I/O. This is rarely reportedfor disk drives. For RAID arrays it is usually the stripe width orthe internal track size. A properly aligned multiple ofoptimal_io_size is the preferred request size for workloads wheresustained throughput is desired.
voidblk_queue_io_opt(struct request_queue * q, unsigned int opt)

set optimal request size for the queue

Parameters

structrequest_queue*q
the request queue for the device
unsignedintopt
optimal request size in bytes

Description

Storage devices may report an optimal I/O size, which is thedevice’s preferred unit for sustained I/O. This is rarely reportedfor disk drives. For RAID arrays it is usually the stripe width orthe internal track size. A properly aligned multiple ofoptimal_io_size is the preferred request size for workloads wheresustained throughput is desired.
intblk_stack_limits(struct queue_limits * t, struct queue_limits * b, sector_t start)

adjust queue_limits for stacked devices

Parameters

structqueue_limits*t
the stacking driver limits (top device)
structqueue_limits*b
the underlying queue limits (bottom, component device)
sector_tstart
first data sector within component device

Description

This function is used by stacking drivers like MD and DM to ensurethat all component devices have compatible block sizes andalignments. The stacking driver must provide a queue_limitsstruct (top) and then iteratively call the stacking function forall component (bottom) devices. The stacking function willattempt to combine the values and ensure proper alignment.

Returns 0 if the top and bottom queue_limits are compatible. Thetop device’s block sizes and alignment offsets may be adjusted toensure alignment with the bottom device. If no compatible sizesand alignments exist, -1 is returned and the resulting topqueue_limits will have the misaligned flag set to indicate thatthe alignment_offset is undefined.

voiddisk_stack_limits(struct gendisk * disk, struct block_device * bdev, sector_t offset)

adjust queue limits for stacked drivers

Parameters

structgendisk*disk
MD/DM gendisk (top)
structblock_device*bdev
the underlying block device (bottom)
sector_toffset
offset to beginning of data within component device

Description

Merges the limits for a top level gendisk and a bottom levelblock_device.
voidblk_queue_update_dma_pad(struct request_queue * q, unsigned int mask)

update pad mask

Parameters

structrequest_queue*q
the request queue for the device
unsignedintmask
pad mask

Description

Update dma pad mask.

Appending pad buffer to a request modifies the last entry of ascatter list such that it includes the pad buffer.

voidblk_queue_segment_boundary(struct request_queue * q, unsigned long mask)

set boundary rules for segment merging

Parameters

structrequest_queue*q
the request queue for the device
unsignedlongmask
the memory boundary mask
voidblk_queue_virt_boundary(struct request_queue * q, unsigned long mask)

set boundary rules for bio merging

Parameters

structrequest_queue*q
the request queue for the device
unsignedlongmask
the memory boundary mask
voidblk_queue_dma_alignment(struct request_queue * q, int mask)

set dma length and memory alignment

Parameters

structrequest_queue*q
the request queue for the device
intmask
alignment mask

Description

set required memory and length alignment for direct dma transactions.this is used when building direct io requests for the queue.
voidblk_queue_update_dma_alignment(struct request_queue * q, int mask)

update dma length and memory alignment

Parameters

structrequest_queue*q
the request queue for the device
intmask
alignment mask

Description

update required memory and length alignment for direct dma transactions.If the requested alignment is larger than the current alignment, thenthe current queue alignment is updated to the new value, otherwise itis left alone. The design of this is to allow multiple objects(driver, device, transport etc) to set their respectivealignments without having them interfere.
voidblk_set_queue_depth(struct request_queue * q, unsigned int depth)

tell the block layer about the device queue depth

Parameters

structrequest_queue*q
the request queue for the device
unsignedintdepth
queue depth
voidblk_queue_write_cache(struct request_queue * q, bool wc, bool fua)

configure queue’s write cache

Parameters

structrequest_queue*q
the request queue for the device
boolwc
write back cache on or off
boolfua
device supports FUA writes, if true

Description

Tell the block layer about the write cache ofq.

voidblk_queue_required_elevator_features(struct request_queue * q, unsigned int features)

Set a queue required elevator features

Parameters

structrequest_queue*q
the request queue for the target device
unsignedintfeatures
Required elevator features OR’ed together

Description

Tell the block layer that for the device controlled throughq, only theonly elevators that can be used are those that implement at least the set offeatures specified byfeatures.

boolblk_queue_can_use_dma_map_merging(struct request_queue * q, structdevice * dev)

configure queue for merging segments.

Parameters

structrequest_queue*q
the request queue for the device
structdevice*dev
the device pointer for dma

Description

Tell the block layer about merging the segments by dma map ofq.

voidblk_queue_set_zoned(struct gendisk * disk, enum blk_zoned_model model)

configure a disk queue zoned model.

Parameters

structgendisk*disk
the gendisk of the queue to configure
enumblk_zoned_modelmodel
the zoned model to set

Description

Set the zoned model of the request queue ofdisk according tomodel.Whenmodel is BLK_ZONED_HM (host managed), this should be called onlyif zoned block device support is enabled (CONFIG_BLK_DEV_ZONED option).Ifmodel specifies BLK_ZONED_HA (host aware), the effective model useddepends on CONFIG_BLK_DEV_ZONED settings and on the existence of partitionson the disk.

voidblk_execute_rq_nowait(struct request_queue * q, struct gendisk * bd_disk, struct request * rq, int at_head, rq_end_io_fn * done)

insert a request into queue for execution

Parameters

structrequest_queue*q
queue to insert the request in
structgendisk*bd_disk
matching gendisk
structrequest*rq
request to insert
intat_head
insert request at head or tail of queue
rq_end_io_fn*done
I/O completion handler

Description

Insert a fully prepared request at the back of the I/O scheduler queuefor execution. Don’t wait for completion.

Note

This function will invokedone directly if the queue is dead.
voidblk_execute_rq(struct request_queue * q, struct gendisk * bd_disk, struct request * rq, int at_head)

insert a request into queue for execution

Parameters

structrequest_queue*q
queue to insert the request in
structgendisk*bd_disk
matching gendisk
structrequest*rq
request to insert
intat_head
insert request at head or tail of queue

Description

Insert a fully prepared request at the back of the I/O scheduler queuefor execution and wait for completion.
intblkdev_issue_flush(struct block_device * bdev, gfp_t gfp_mask)

queue a flush

Parameters

structblock_device*bdev
blockdev to issue flush for
gfp_tgfp_mask
memory allocation flags (for bio_alloc)

Description

Issue a flush for the block device in question.
intblkdev_issue_discard(struct block_device * bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, unsigned long flags)

queue a discard

Parameters

structblock_device*bdev
blockdev to issue discard for
sector_tsector
start sector
sector_tnr_sects
number of sectors to discard
gfp_tgfp_mask
memory allocation flags (for bio_alloc)
unsignedlongflags
BLKDEV_DISCARD_* flags to control behaviour

Description

Issue a discard request for the sectors in question.
intblkdev_issue_write_same(struct block_device * bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct page * page)

queue a write same operation

Parameters

structblock_device*bdev
target blockdev
sector_tsector
start sector
sector_tnr_sects
number of sectors to write
gfp_tgfp_mask
memory allocation flags (for bio_alloc)
structpage*page
page containing data

Description

Issue a write same request for the sectors in question.
int__blkdev_issue_zeroout(struct block_device * bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio ** biop, unsigned flags)

generate number of zero filed write bios

Parameters

structblock_device*bdev
blockdev to issue
sector_tsector
start sector
sector_tnr_sects
number of sectors to write
gfp_tgfp_mask
memory allocation flags (for bio_alloc)
structbio**biop
pointer to anchor bio
unsignedflags
controls detailed behavior

Description

Zero-fill a block range, either using hardware offload or by explicitlywriting zeroes to the device.

If a device is using logical block provisioning, the underlying space willnot be released ifflags contains BLKDEV_ZERO_NOUNMAP.

Ifflags contains BLKDEV_ZERO_NOFALLBACK, the function will return-EOPNOTSUPP if no explicit hardware offload for zeroing is provided.

intblkdev_issue_zeroout(struct block_device * bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, unsigned flags)

zero-fill a block range

Parameters

structblock_device*bdev
blockdev to write
sector_tsector
start sector
sector_tnr_sects
number of sectors to write
gfp_tgfp_mask
memory allocation flags (for bio_alloc)
unsignedflags
controls detailed behavior

Description

Zero-fill a block range, either using hardware offload or by explicitlywriting zeroes to the device. See__blkdev_issue_zeroout() for thevalid values forflags.
intblk_rq_count_integrity_sg(struct request_queue * q, struct bio * bio)

Count number of integrity scatterlist elements

Parameters

structrequest_queue*q
request queue
structbio*bio
bio with integrity metadata attached

Description

Returns the number of elements required in ascatterlist corresponding to the integrity metadata in a bio.

intblk_rq_map_integrity_sg(struct request_queue * q, struct bio * bio, struct scatterlist * sglist)

Map integrity metadata into a scatterlist

Parameters

structrequest_queue*q
request queue
structbio*bio
bio with integrity metadata attached
structscatterlist*sglist
target scatterlist

Description

Map the integrity vectors in request into ascatterlist. The scatterlist must be big enough to hold allelements. I.e. sized usingblk_rq_count_integrity_sg().

intblk_integrity_compare(struct gendisk * gd1, struct gendisk * gd2)

Compare integrity profile of two disks

Parameters

structgendisk*gd1
Disk to compare
structgendisk*gd2
Disk to compare

Description

Meta-devices like DM and MD need to verify that allsub-devices use the same integrity format before advertising toupper layers that they can send/receive integrity metadata. Thisfunction can be used to check whether two gendisk devices havecompatible integrity formats.

voidblk_integrity_register(struct gendisk * disk, struct blk_integrity * template)

Register a gendisk as being integrity-capable

Parameters

structgendisk*disk
struct gendisk pointer to make integrity-aware
structblk_integrity*template
block integrity profile to register

Description

When a device needs to advertise itself as being able tosend/receive integrity metadata it must use this function to registerthe capability with the block layer. The template is a blk_integritystruct with values appropriate for the underlying hardware. SeeDocumentation/block/data-integrity.rst.

voidblk_integrity_unregister(struct gendisk * disk)

Unregister block integrity profile

Parameters

structgendisk*disk
disk whose integrity profile to unregister

Description

This function unregisters the integrity capability froma block device.

intblk_trace_ioctl(struct block_device * bdev, unsigned cmd, char __user * arg)

handle the ioctls associated with tracing

Parameters

structblock_device*bdev
the block device
unsignedcmd
the ioctl cmd
char__user*arg
the argument data, if any
voidblk_trace_shutdown(struct request_queue * q)

stop and cleanup trace structures

Parameters

structrequest_queue*q
the request queue associated with the device
voidblk_add_trace_rq(struct request * rq, int error, unsigned int nr_bytes, u32 what, u64 cgid)

Add a trace for a request oriented action

Parameters

structrequest*rq
the source request
interror
return status to log
unsignedintnr_bytes
number of completed bytes
u32what
the action
u64cgid
the cgroup info

Description

Records an action against a request. Will log the bio offset + size.
voidblk_add_trace_bio(struct request_queue * q, struct bio * bio, u32 what, int error)

Add a trace for a bio oriented action

Parameters

structrequest_queue*q
queue the io is for
structbio*bio
the source bio
u32what
the action
interror
error, if any

Description

Records an action against a bio. Will log the bio offset + size.
voidblk_add_trace_bio_remap(void * ignore, struct request_queue * q, struct bio * bio, dev_t dev, sector_t from)

Add a trace for a bio-remap operation

Parameters

void*ignore
trace callback data parameter (not used)
structrequest_queue*q
queue the io is for
structbio*bio
the source bio
dev_tdev
target device
sector_tfrom
source sector

Description

Device mapper or raid target sometimes need to split a bio becauseit spans a stripe (or similar). Add a trace for that action.
voidblk_add_trace_rq_remap(void * ignore, struct request_queue * q, struct request * rq, dev_t dev, sector_t from)

Add a trace for a request-remap operation

Parameters

void*ignore
trace callback data parameter (not used)
structrequest_queue*q
queue the io is for
structrequest*rq
the source request
dev_tdev
target device
sector_tfrom
source sector

Description

Device mapper remaps request to other devices.Add a trace for that action.
struct hd_struct *disk_get_part(struct gendisk * disk, int partno)

get partition

Parameters

structgendisk*disk
disk to look partition from
intpartno
partition number

Description

Look for partitionpartno fromdisk. If found, incrementreference count and return it.

Context

Don’t care.

Return

Pointer to the found partition on success, NULL if not found.

struct hd_struct *disk_map_sector_rcu(struct gendisk * disk, sector_t sector)

map sector to partition

Parameters

structgendisk*disk
gendisk of interest
sector_tsector
sector to map

Description

Find out which partitionsector maps to ondisk. This isprimarily used for stats accounting.

Context

RCU read locked. The returned partition pointer is always validbecause its refcount is grabbed except for part0, which lifetimeis same with the disk.

Return

Found partition on success, part0 is returned if no partition matchesor the matched partition is being deleted.

intblk_mangle_minor(int minor)

scatter minor numbers apart

Parameters

intminor
minor number to mangle

Description

Scatter consecutively allocatedminor number apart if MANGLE_DEVTis enabled. Mangling twice gives the original value.

Return

Mangled value.

Context

Don’t care.

intblk_alloc_devt(struct hd_struct * part, dev_t * devt)

allocate a dev_t for a partition

Parameters

structhd_struct*part
partition to allocate dev_t for
dev_t*devt
out parameter for resulting dev_t

Description

Allocate a dev_t for block device.

Return

0 on success, allocated dev_t is returned in*devt. -errno onfailure.

Context

Might sleep.

voidblk_free_devt(dev_t devt)

free a dev_t

Parameters

dev_tdevt
dev_t to free

Description

Freedevt which was allocated usingblk_alloc_devt().

Context

Might sleep.

void__device_add_disk(structdevice * parent, struct gendisk * disk, const struct attribute_group ** groups, bool register_queue)

add disk information to kernel list

Parameters

structdevice*parent
parent device for the disk
structgendisk*disk
per-device partitioning information
conststructattribute_group**groups
Additional per-device sysfs groups
boolregister_queue
register the queue if set to true

Description

This function registers the partitioning information indiskwith the kernel.

FIXME: error handling

struct gendisk *get_gendisk(dev_t devt, int * partno)

get partitioning information for a given device

Parameters

dev_tdevt
device to get partitioning information for
int*partno
returned partition index

Description

This function gets the structure containing partitioninginformation for the given devicedevt.

Context

can sleep

voiddisk_replace_part_tbl(struct gendisk * disk, struct disk_part_tbl * new_ptbl)

replace disk->part_tbl in RCU-safe way

Parameters

structgendisk*disk
disk to replace part_tbl for
structdisk_part_tbl*new_ptbl
new part_tbl to install

Description

Replace disk->part_tbl withnew_ptbl in RCU-safe way. Theoriginal ptbl is freed using RCU callback.

LOCKING:Matching bd_mutex locked or the caller is the only user ofdisk.

intdisk_expand_part_tbl(struct gendisk * disk, int partno)

expand disk->part_tbl

Parameters

structgendisk*disk
disk to expand part_tbl for
intpartno
expand such that this partno can fit in

Description

Expand disk->part_tbl such thatpartno can fit in. disk->part_tbluses RCU to allow unlocked dereferencing for stats and other stuff.

LOCKING:Matching bd_mutex locked or the caller is the only user ofdisk.Might sleep.

Return

0 on success, -errno on failure.

voiddisk_release(structdevice * dev)

releases all allocated resources of the gendisk

Parameters

structdevice*dev
the device representing this disk

Description

This function releases all allocated resources of the gendisk.

The struct gendisk refcount is incremented withget_gendisk() orget_disk_and_module(), and its refcount is decremented withput_disk_and_module() orput_disk(). Once the refcount reaches 0 thisfunction is called.

Drivers which used__device_add_disk() have a gendisk with a request_queueassigned. Since the request_queue sits on top of the gendisk for thesedrivers we also callblk_put_queue() for them, and we expect therequest_queue refcount to reach 0 at this point, and so the request_queuewill also be freed prior to the disk.

Context

can sleep

voiddisk_block_events(struct gendisk * disk)

block and flush disk event checking

Parameters

structgendisk*disk
disk to block events for

Description

On return from this function, it is guaranteed that event checkingisn’t in progress and won’t happen until unblocked bydisk_unblock_events(). Events blocking is counted and the actualunblocking happens after the matching number of unblocks are done.

Note that this intentionally does not block event checking fromdisk_clear_events().

Context

Might sleep.

voiddisk_unblock_events(struct gendisk * disk)

unblock disk event checking

Parameters

structgendisk*disk
disk to unblock events for

Description

Undodisk_block_events(). When the block count reaches zero, itstarts events polling if configured.

Context

Don’t care. Safe to call from irq context.

voiddisk_flush_events(struct gendisk * disk, unsigned int mask)

schedule immediate event checking and flushing

Parameters

structgendisk*disk
disk to check and flush events for
unsignedintmask
events to flush

Description

Schedule immediate event checking ondisk if not blocked. Events inmask are scheduled to be cleared from the driver. Note that thisdoesn’t clear the events fromdisk->ev.

Context

Ifmask is non-zero must be called with bdev->bd_mutex held.

unsigned intdisk_clear_events(struct gendisk * disk, unsigned int mask)

synchronously check, clear and return pending events

Parameters

structgendisk*disk
disk to fetch and clear events from
unsignedintmask
mask of events to be fetched and cleared

Description

Disk events are synchronously checked and pending events inmaskare cleared and returned. This ignores the block count.

Context

Might sleep.

voiddisk_part_iter_init(struct disk_part_iter * piter, struct gendisk * disk, unsigned int flags)

initialize partition iterator

Parameters

structdisk_part_iter*piter
iterator to initialize
structgendisk*disk
disk to iterate over
unsignedintflags
DISK_PITER_* flags

Description

Initializepiter so that it iterates over partitions ofdisk.

Context

Don’t care.

struct hd_struct *disk_part_iter_next(struct disk_part_iter * piter)

proceed iterator to the next partition and return it

Parameters

structdisk_part_iter*piter
iterator of interest

Description

Proceedpiter to the next partition and return it.

Context

Don’t care.

voiddisk_part_iter_exit(struct disk_part_iter * piter)

finish up partition iteration

Parameters

structdisk_part_iter*piter
iter of interest

Description

Called when iteration is over. Cleans uppiter.

Context

Don’t care.

booldisk_has_partitions(struct gendisk * disk)

Parameters

structgendisk*disk
gendisk of interest

Description

Walk through the partition table and check if valid partition exists.

Context

Don’t care.

Return

True if the gendisk has at least one valid non-zero size partition.Otherwise false.

intregister_blkdev(unsigned int major, const char * name)

register a new block device

Parameters

unsignedintmajor
the requested major device number [1..BLKDEV_MAJOR_MAX-1]. Ifmajor = 0, try to allocate any unused major number.
constchar*name
the name of the new block device as a zero terminated string

Description

Thename must be unique within the system.

The return value depends on themajor input parameter:

  • if a major device number was requested in range [1..BLKDEV_MAJOR_MAX-1]then the function returns zero on success, or a negative error code
  • if any unused major number was requested withmajor = 0 parameterthen the return value is the allocated major number in range[1..BLKDEV_MAJOR_MAX-1] or a negative error code otherwise

See Documentation/admin-guide/devices.txt for the list of allocatedmajor numbers.

voiddel_gendisk(struct gendisk * disk)

remove the gendisk

Parameters

structgendisk*disk
the struct gendisk to remove

Description

Removes the gendisk and all its associated resources. This deletes thepartitions associated with the gendisk, and unregisters the associatedrequest_queue.

This is the counter to the respective__device_add_disk() call.

The final removal of the struct gendisk happens when its refcount reaches 0withput_disk(), which should be called afterdel_gendisk(), if__device_add_disk() was used.

Drivers exist which depend on the release of the gendisk to be synchronous,it should not be deferred.

Context

can sleep

struct block_device *bdget_disk(struct gendisk * disk, int partno)

do bdget() by gendisk and partition number

Parameters

structgendisk*disk
gendisk of interest
intpartno
partition number

Description

Find partitionpartno fromdisk, do bdget() on it.

Context

Don’t care.

Return

Resulting block_device on success, NULL on failure.

struct kobject *get_disk_and_module(struct gendisk * disk)

increments the gendisk and gendisk fops module refcount

Parameters

structgendisk*disk
the struct gendisk to increment the refcount for

Description

This increments the refcount for the struct gendisk, and the gendisk’sfops module owner.

Context

Any context.

voidput_disk(struct gendisk * disk)

decrements the gendisk refcount

Parameters

structgendisk*disk
the struct gendisk to decrement the refcount for

Description

This decrements the refcount for the struct gendisk. When this reaches 0we’ll havedisk_release() called.

Context

Any context, but the last reference must not be dropped fromatomic context.

voidput_disk_and_module(struct gendisk * disk)

decrements the module and gendisk refcount

Parameters

structgendisk*disk
the struct gendisk to decrement the refcount for

Description

This is a counterpart ofget_disk_and_module() and thus also ofget_gendisk().

Context

Any context, but the last reference must not be dropped fromatomic context.

Char devices

intregister_chrdev_region(dev_t from, unsigned count, const char * name)

register a range of device numbers

Parameters

dev_tfrom
the first in the desired range of device numbers; must includethe major number.
unsignedcount
the number of consecutive device numbers required
constchar*name
the name of the device or driver.

Description

Return value is zero on success, a negative error code on failure.

intalloc_chrdev_region(dev_t * dev, unsigned baseminor, unsigned count, const char * name)

register a range of char device numbers

Parameters

dev_t*dev
output parameter for first assigned number
unsignedbaseminor
first of the requested range of minor numbers
unsignedcount
the number of minor numbers required
constchar*name
the name of the associated device or driver

Description

Allocates a range of char device numbers. The major number will bechosen dynamically, and returned (along with the first minor number)indev. Returns zero or a negative error code.

int__register_chrdev(unsigned int major, unsigned int baseminor, unsigned int count, const char * name, const struct file_operations * fops)

create and register a cdev occupying a range of minors

Parameters

unsignedintmajor
major device number or 0 for dynamic allocation
unsignedintbaseminor
first of the requested range of minor numbers
unsignedintcount
the number of minor numbers required
constchar*name
name of this range of devices
conststructfile_operations*fops
file operations associated with this devices

Description

Ifmajor == 0 this functions will dynamically allocate a major and returnits number.

Ifmajor > 0 this function will attempt to reserve a device with the givenmajor number and will return zero on success.

Returns a -ve errno on failure.

The name of this device has nothing to do with the name of the device in/dev. It only helps to keep track of the different owners of devices. Ifyour module name has only one type of devices it’s ok to use e.g. the nameof the module here.

voidunregister_chrdev_region(dev_t from, unsigned count)

unregister a range of device numbers

Parameters

dev_tfrom
the first in the range of numbers to unregister
unsignedcount
the number of device numbers to unregister

Description

This function will unregister a range ofcount device numbers,starting withfrom. The caller should normally be the one whoallocated those numbers in the first place…

void__unregister_chrdev(unsigned int major, unsigned int baseminor, unsigned int count, const char * name)

unregister and destroy a cdev

Parameters

unsignedintmajor
major device number
unsignedintbaseminor
first of the range of minor numbers
unsignedintcount
the number of minor numbers this cdev is occupying
constchar*name
name of this range of devices

Description

Unregister and destroy the cdev occupying the region described bymajor,baseminor andcount. This function undoes what__register_chrdev() did.

intcdev_add(struct cdev * p, dev_t dev, unsigned count)

add a char device to the system

Parameters

structcdev*p
the cdev structure for the device
dev_tdev
the first device number for which this device is responsible
unsignedcount
the number of consecutive minor numbers corresponding to thisdevice

Description

cdev_add() adds the device represented byp to the system, making itlive immediately. A negative error code is returned on failure.

voidcdev_set_parent(struct cdev * p, struct kobject * kobj)

set the parent kobject for a char device

Parameters

structcdev*p
the cdev structure
structkobject*kobj
the kobject to take a reference to

Description

cdev_set_parent() sets a parent kobject which will be referencedappropriately so the parent is not freed before the cdev. Thisshould be called before cdev_add.

intcdev_device_add(struct cdev * cdev, structdevice * dev)

add a char device and it’s corresponding struct device, linkink

Parameters

structcdev*cdev
the cdev structure
structdevice*dev
the device structure

Description

cdev_device_add() adds the char device represented bycdev to the system,just as cdev_add does. It then addsdev to the system using device_addThe dev_t for the char device will be taken from the struct device whichneeds to be initialized first. This helper function correctly takes areference to the parent device so the parent will not get released untilall references to the cdev are released.

This helper uses dev->devt for the device number. If it is not setit will not add the cdev and it will be equivalent to device_add.

This function should be used whenever the struct cdev and thestruct device are members of the same structure whose lifetime ismanaged by the struct device.

NOTE

Callers must assume that userspace was able to open the cdev andcan call cdev fops callbacks at any time, even if this function fails.

voidcdev_device_del(struct cdev * cdev, structdevice * dev)

inverse of cdev_device_add

Parameters

structcdev*cdev
the cdev structure
structdevice*dev
the device structure

Description

cdev_device_del() is a helper function to call cdev_del and device_del.It should be used whenever cdev_device_add is used.

If dev->devt is not set it will not remove the cdev and will be equivalentto device_del.

NOTE

This guarantees that associated sysfs callbacks are not runningor runnable, however any cdevs already open will remain and their fopswill still be callable even after this function returns.

voidcdev_del(struct cdev * p)

remove a cdev from the system

Parameters

structcdev*p
the cdev structure to be removed

Description

cdev_del() removesp from the system, possibly freeing the structureitself.

NOTE

This guarantees that cdev device will no longer be able to beopened, however any cdevs already open will remain and their fops willstill be callable even after cdev_del returns.

struct cdev *cdev_alloc(void)

allocate a cdev structure

Parameters

void
no arguments

Description

Allocates and returns a cdev structure, or NULL on failure.

voidcdev_init(struct cdev * cdev, const struct file_operations * fops)

initialize a cdev structure

Parameters

structcdev*cdev
the structure to initialize
conststructfile_operations*fops
the file_operations for this device

Description

Initializescdev, rememberingfops, making it ready to add to thesystem withcdev_add().

Clock Framework

The clock framework defines programming interfaces to support softwaremanagement of the system clock tree. This framework is widely used withSystem-On-Chip (SOC) platforms to support power management and variousdevices which may need custom clock rates. Note that these “clocks”don’t relate to timekeeping or real time clocks (RTCs), each of whichhave separate frameworks. Thesestructclkinstances may be used to manage for example a 96 MHz signal that is usedto shift bits into and out of peripherals or busses, or otherwisetrigger synchronous state machine transitions in system hardware.

Power management is supported by explicit software clock gating: unusedclocks are disabled, so the system doesn’t waste power changing thestate of transistors that aren’t in active use. On some systems this maybe backed by hardware clock gating, where clocks are gated without beingdisabled in software. Sections of chips that are powered but not clockedmay be able to retain their last state. This low power state is oftencalled aretention mode. This mode still incurs leakage currents,especially with finer circuit geometries, but for CMOS circuits power ismostly used by clocked state changes.

Power-aware drivers only enable their clocks when the device they manageis in active use. Also, system sleep states often differ according towhich clock domains are active: while a “standby” state may allow wakeupfrom several active domains, a “mem” (suspend-to-RAM) state may requirea more wholesale shutdown of clocks derived from higher speed PLLs andoscillators, limiting the number of possible wakeup event sources. Adriver’s suspend method may need to be aware of system-specific clockconstraints on the target sleep state.

Some platforms support programmable clock generators. These can be usedby external chips of various kinds, such as other CPUs, multimediacodecs, and devices with strict requirements for interface clocking.

structclk_notifier

associate a clk with a notifier

Definition

struct clk_notifier {  struct clk                      *clk;  struct srcu_notifier_head       notifier_head;  struct list_head                node;};

Members

clk
struct clk * to associate the notifier with
notifier_head
a blocking_notifier_head for this clk
node
linked list pointers

Description

A list of struct clk_notifier is maintained by the notifier code.An entry is created whenever code registers the first notifier on aparticularclk. Future notifiers on thatclk are added to thenotifier_head.

structclk_notifier_data

rate data to pass to the notifier callback

Definition

struct clk_notifier_data {  struct clk              *clk;  unsigned long           old_rate;  unsigned long           new_rate;};

Members

clk
struct clk * being changed
old_rate
previous rate of this clk
new_rate
new rate of this clk

Description

For a pre-notifier, old_rate is the clk’s rate before this ratechange, and new_rate is what the rate will be in the future. For apost-notifier, old_rate and new_rate are both set to the clk’scurrent rate (this was done to optimize the implementation).

structclk_bulk_data

Data used for bulk clk operations.

Definition

struct clk_bulk_data {  const char              *id;  struct clk              *clk;};

Members

id
clock consumer ID
clk
struct clk * to store the associated clock

Description

The CLK APIs provide a series of clk_bulk_() API calls asa convenience to consumers which require multiple clks. Thisstructure is used to manage data for these calls.

intclk_notifier_register(struct clk * clk, struct notifier_block * nb)

change notifier callback

Parameters

structclk*clk
clock whose rate we are interested in
structnotifier_block*nb
notifier block with callback function pointer

Description

ProTip: debugging across notifier chains can be frustrating. Make sure thatyour notifier callback function prints a nice big warning in case offailure.

intclk_notifier_unregister(struct clk * clk, struct notifier_block * nb)

change notifier callback

Parameters

structclk*clk
clock whose rate we are no longer interested in
structnotifier_block*nb
notifier block which will be unregistered
longclk_get_accuracy(struct clk * clk)

obtain the clock accuracy in ppb (parts per billion) for a clock source.

Parameters

structclk*clk
clock source

Description

This gets the clock source accuracy expressed in ppb.A perfect clock returns 0.

intclk_set_phase(struct clk * clk, int degrees)

adjust the phase shift of a clock signal

Parameters

structclk*clk
clock signal source
intdegrees
number of degrees the signal is shifted

Description

Shifts the phase of a clock signal by the specified degrees. Returns 0 onsuccess, -EERROR otherwise.

intclk_get_phase(struct clk * clk)

return the phase shift of a clock signal

Parameters

structclk*clk
clock signal source

Description

Returns the phase shift of a clock node in degrees, otherwise returns-EERROR.

intclk_set_duty_cycle(struct clk * clk, unsigned int num, unsigned int den)

adjust the duty cycle ratio of a clock signal

Parameters

structclk*clk
clock signal source
unsignedintnum
numerator of the duty cycle ratio to be applied
unsignedintden
denominator of the duty cycle ratio to be applied

Description

Adjust the duty cycle of a clock signal by the specified ratio. Returns 0 onsuccess, -EERROR otherwise.

intclk_get_scaled_duty_cycle(struct clk * clk, unsigned int scale)

return the duty cycle ratio of a clock signal

Parameters

structclk*clk
clock signal source
unsignedintscale
scaling factor to be applied to represent the ratio as an integer

Description

Returns the duty cycle ratio multiplied by the scale provided, otherwisereturns -EERROR.

boolclk_is_match(const struct clk * p, const struct clk * q)

check if two clk’s point to the same hardware clock

Parameters

conststructclk*p
clk compared against q
conststructclk*q
clk compared against p

Description

Returns true if the two struct clk pointers both point to the same hardwareclock node. Put differently, returns true ifp andqshare the samestructclk_core object.

Returns false otherwise. Note that two NULL clks are treated as matching.

intclk_prepare(struct clk * clk)

prepare a clock source

Parameters

structclk*clk
clock source

Description

This prepares the clock source for use.

Must not be called from within atomic context.

voidclk_unprepare(struct clk * clk)

undo preparation of a clock source

Parameters

structclk*clk
clock source

Description

This undoes a previously prepared clock. The caller must balancethe number of prepare and unprepare calls.

Must not be called from within atomic context.

struct clk *clk_get(structdevice * dev, const char * id)

lookup and obtain a reference to a clock producer.

Parameters

structdevice*dev
device for clock “consumer”
constchar*id
clock consumer ID

Description

Returns a struct clk corresponding to the clock producer, orvalid IS_ERR() condition containing errno. The implementationusesdev andid to determine the clock consumer, and therebythe clock producer. (IOW,id may be identical strings, butclk_get may return different clock producers depending ondev.)

Drivers must assume that the clock source is not enabled.

clk_get should not be called from within interrupt context.

intclk_bulk_get(structdevice * dev, int num_clks, structclk_bulk_data * clks)

lookup and obtain a number of references to clock producer.

Parameters

structdevice*dev
device for clock “consumer”
intnum_clks
the number of clk_bulk_data
structclk_bulk_data*clks
the clk_bulk_data table of consumer

Description

This helper function allows drivers to get several clk consumers in oneoperation. If any of the clk cannot be acquired then any clksthat were obtained will be freed before returning to the caller.

Returns 0 if all clocks specified in clk_bulk_data table are obtainedsuccessfully, or valid IS_ERR() condition containing errno.The implementation usesdev andclk_bulk_data.id to determine theclock consumer, and thereby the clock producer.The clock returned is stored in eachclk_bulk_data.clk field.

Drivers must assume that the clock source is not enabled.

clk_bulk_get should not be called from within interrupt context.

intclk_bulk_get_all(structdevice * dev, structclk_bulk_data ** clks)

lookup and obtain all available references to clock producer.

Parameters

structdevice*dev
device for clock “consumer”
structclk_bulk_data**clks
pointer to the clk_bulk_data table of consumer

Description

This helper function allows drivers to get all clk consumers in oneoperation. If any of the clk cannot be acquired then any clksthat were obtained will be freed before returning to the caller.

Returns a positive value for the number of clocks obtained while theclock references are stored in the clk_bulk_data table inclks field.Returns 0 if there’re none and a negative value if something failed.

Drivers must assume that the clock source is not enabled.

clk_bulk_get should not be called from within interrupt context.

intclk_bulk_get_optional(structdevice * dev, int num_clks, structclk_bulk_data * clks)

lookup and obtain a number of references to clock producer

Parameters

structdevice*dev
device for clock “consumer”
intnum_clks
the number of clk_bulk_data
structclk_bulk_data*clks
the clk_bulk_data table of consumer

Description

Behaves the same asclk_bulk_get() except where there is no clock producer.In this case, instead of returning -ENOENT, the function returns 0 andNULL for a clk for which a clock producer could not be determined.

intdevm_clk_bulk_get(structdevice * dev, int num_clks, structclk_bulk_data * clks)

managed get multiple clk consumers

Parameters

structdevice*dev
device for clock “consumer”
intnum_clks
the number of clk_bulk_data
structclk_bulk_data*clks
the clk_bulk_data table of consumer

Description

Return 0 on success, an errno on failure.

This helper function allows drivers to get several clkconsumers in one operation with management, the clks willautomatically be freed when the device is unbound.

intdevm_clk_bulk_get_optional(structdevice * dev, int num_clks, structclk_bulk_data * clks)

managed get multiple optional consumer clocks

Parameters

structdevice*dev
device for clock “consumer”
intnum_clks
the number of clk_bulk_data
structclk_bulk_data*clks
pointer to the clk_bulk_data table of consumer

Description

Behaves the same asdevm_clk_bulk_get() except where there is no clockproducer. In this case, instead of returning -ENOENT, the function returnsNULL for given clk. It is assumed all clocks in clk_bulk_data are optional.

Returns 0 if all clocks specified in clk_bulk_data table are obtainedsuccessfully or for any clk there was no clk provider available, otherwisereturns valid IS_ERR() condition containing errno.The implementation usesdev andclk_bulk_data.id to determine theclock consumer, and thereby the clock producer.The clock returned is stored in eachclk_bulk_data.clk field.

Drivers must assume that the clock source is not enabled.

clk_bulk_get should not be called from within interrupt context.

intdevm_clk_bulk_get_all(structdevice * dev, structclk_bulk_data ** clks)

managed get multiple clk consumers

Parameters

structdevice*dev
device for clock “consumer”
structclk_bulk_data**clks
pointer to the clk_bulk_data table of consumer

Description

Returns a positive value for the number of clocks obtained while theclock references are stored in the clk_bulk_data table inclks field.Returns 0 if there’re none and a negative value if something failed.

This helper function allows drivers to get several clkconsumers in one operation with management, the clks willautomatically be freed when the device is unbound.

struct clk *devm_clk_get(structdevice * dev, const char * id)

lookup and obtain a managed reference to a clock producer.

Parameters

structdevice*dev
device for clock “consumer”
constchar*id
clock consumer ID

Description

Returns a struct clk corresponding to the clock producer, orvalid IS_ERR() condition containing errno. The implementationusesdev andid to determine the clock consumer, and therebythe clock producer. (IOW,id may be identical strings, butclk_get may return different clock producers depending ondev.)

Drivers must assume that the clock source is not enabled.

devm_clk_get should not be called from within interrupt context.

The clock will automatically be freed when the device is unboundfrom the bus.

struct clk *devm_clk_get_optional(structdevice * dev, const char * id)

lookup and obtain a managed reference to an optional clock producer.

Parameters

structdevice*dev
device for clock “consumer”
constchar*id
clock consumer ID

Description

Behaves the same asdevm_clk_get() except where there is no clock producer.In this case, instead of returning -ENOENT, the function returns NULL.

struct clk *devm_get_clk_from_child(structdevice * dev, struct device_node * np, const char * con_id)

lookup and obtain a managed reference to a clock producer from child node.

Parameters

structdevice*dev
device for clock “consumer”
structdevice_node*np
pointer to clock consumer node
constchar*con_id
clock consumer ID

Description

This function parses the clocks, and uses them to look up thestruct clk from the registered list of clock providers by usingnp andcon_id

The clock will automatically be freed when the device is unboundfrom the bus.

intclk_rate_exclusive_get(struct clk * clk)

get exclusivity over the rate control of a producer

Parameters

structclk*clk
clock source

Description

This function allows drivers to get exclusive control over the rate of aprovider. It prevents any other consumer to execute, even indirectly,opereation which could alter the rate of the provider or cause glitches

If exlusivity is claimed more than once on clock, even by the same driver,the rate effectively gets locked as exclusivity can’t be preempted.

Must not be called from within atomic context.

Returns success (0) or negative errno.

voidclk_rate_exclusive_put(struct clk * clk)

release exclusivity over the rate control of a producer

Parameters

structclk*clk
clock source

Description

This function allows drivers to release the exclusivity it previously gotfromclk_rate_exclusive_get()

The caller must balance the number ofclk_rate_exclusive_get() andclk_rate_exclusive_put() calls.

Must not be called from within atomic context.

intclk_enable(struct clk * clk)

inform the system when the clock source should be running.

Parameters

structclk*clk
clock source

Description

If the clock can not be enabled/disabled, this should return success.

May be called from atomic contexts.

Returns success (0) or negative errno.

intclk_bulk_enable(int num_clks, const structclk_bulk_data * clks)

inform the system when the set of clks should be running.

Parameters

intnum_clks
the number of clk_bulk_data
conststructclk_bulk_data*clks
the clk_bulk_data table of consumer

Description

May be called from atomic contexts.

Returns success (0) or negative errno.

voidclk_disable(struct clk * clk)

inform the system when the clock source is no longer required.

Parameters

structclk*clk
clock source

Description

Inform the system that a clock source is no longer required bya driver and may be shut down.

May be called from atomic contexts.

Implementation detail: if the clock source is shared betweenmultiple drivers,clk_enable() calls must be balanced by thesame number ofclk_disable() calls for the clock source to bedisabled.

voidclk_bulk_disable(int num_clks, const structclk_bulk_data * clks)

inform the system when the set of clks is no longer required.

Parameters

intnum_clks
the number of clk_bulk_data
conststructclk_bulk_data*clks
the clk_bulk_data table of consumer

Description

Inform the system that a set of clks is no longer required bya driver and may be shut down.

May be called from atomic contexts.

Implementation detail: if the set of clks is shared betweenmultiple drivers,clk_bulk_enable() calls must be balanced by thesame number ofclk_bulk_disable() calls for the clock source to bedisabled.

unsigned longclk_get_rate(struct clk * clk)

obtain the current clock rate (in Hz) for a clock source. This is only valid once the clock source has been enabled.

Parameters

structclk*clk
clock source
voidclk_put(struct clk * clk)

“free” the clock source

Parameters

structclk*clk
clock source

Note

drivers must ensure that all clk_enable calls made on thisclock source are balanced by clk_disable calls prior to callingthis function.

Description

clk_put should not be called from within interrupt context.

voidclk_bulk_put(int num_clks, structclk_bulk_data * clks)

“free” the clock source

Parameters

intnum_clks
the number of clk_bulk_data
structclk_bulk_data*clks
the clk_bulk_data table of consumer

Note

drivers must ensure that all clk_bulk_enable calls made on thisclock source are balanced by clk_bulk_disable calls prior to callingthis function.

Description

clk_bulk_put should not be called from within interrupt context.

voidclk_bulk_put_all(int num_clks, structclk_bulk_data * clks)

“free” all the clock source

Parameters

intnum_clks
the number of clk_bulk_data
structclk_bulk_data*clks
the clk_bulk_data table of consumer

Note

drivers must ensure that all clk_bulk_enable calls made on thisclock source are balanced by clk_bulk_disable calls prior to callingthis function.

Description

clk_bulk_put_all should not be called from within interrupt context.

voiddevm_clk_put(structdevice * dev, struct clk * clk)

“free” a managed clock source

Parameters

structdevice*dev
device used to acquire the clock
structclk*clk
clock source acquired withdevm_clk_get()

Note

drivers must ensure that all clk_enable calls made on thisclock source are balanced by clk_disable calls prior to callingthis function.

Description

clk_put should not be called from within interrupt context.

longclk_round_rate(struct clk * clk, unsigned long rate)

adjust a rate to the exact rate a clock can provide

Parameters

structclk*clk
clock source
unsignedlongrate
desired clock rate in Hz

Description

This answers the question “if I were to passrate toclk_set_rate(),what clock rate would I end up with?” without changing the hardwarein any way. In other words:

rate = clk_round_rate(clk, r);

and:

clk_set_rate(clk, r);rate = clk_get_rate(clk);

are equivalent except the former does not modify the clock hardwarein any way.

Returns rounded clock rate in Hz, or negative errno.

intclk_set_rate(struct clk * clk, unsigned long rate)

set the clock rate for a clock source

Parameters

structclk*clk
clock source
unsignedlongrate
desired clock rate in Hz

Description

Updating the rate starts at the top-most affected clock and thenwalks the tree down to the bottom-most clock that needs updating.

Returns success (0) or negative errno.

intclk_set_rate_exclusive(struct clk * clk, unsigned long rate)

set the clock rate and claim exclusivity over clock source

Parameters

structclk*clk
clock source
unsignedlongrate
desired clock rate in Hz

Description

This helper function allows drivers to atomically set the rate of a producerand claim exclusivity over the rate control of the producer.

It is essentially a combination ofclk_set_rate() andclk_rate_exclusite_get(). Caller must balance this call with a call toclk_rate_exclusive_put()

Returns success (0) or negative errno.

boolclk_has_parent(struct clk * clk, struct clk * parent)

check if a clock is a possible parent for another

Parameters

structclk*clk
clock source
structclk*parent
parent clock source

Description

This function can be used in drivers that need to check that a clock can bethe parent of another without actually changing the parent.

Returns true ifparent is a possible parent forclk, false otherwise.

intclk_set_rate_range(struct clk * clk, unsigned long min, unsigned long max)

set a rate range for a clock source

Parameters

structclk*clk
clock source
unsignedlongmin
desired minimum clock rate in Hz, inclusive
unsignedlongmax
desired maximum clock rate in Hz, inclusive

Description

Returns success (0) or negative errno.

intclk_set_min_rate(struct clk * clk, unsigned long rate)

set a minimum clock rate for a clock source

Parameters

structclk*clk
clock source
unsignedlongrate
desired minimum clock rate in Hz, inclusive

Description

Returns success (0) or negative errno.

intclk_set_max_rate(struct clk * clk, unsigned long rate)

set a maximum clock rate for a clock source

Parameters

structclk*clk
clock source
unsignedlongrate
desired maximum clock rate in Hz, inclusive

Description

Returns success (0) or negative errno.

intclk_set_parent(struct clk * clk, struct clk * parent)

set the parent clock source for this clock

Parameters

structclk*clk
clock source
structclk*parent
parent clock source

Description

Returns success (0) or negative errno.

struct clk *clk_get_parent(struct clk * clk)

get the parent clock source for this clock

Parameters

structclk*clk
clock source

Description

Returns struct clk corresponding to parent clock source, orvalid IS_ERR() condition containing errno.

struct clk *clk_get_sys(const char * dev_id, const char * con_id)

get a clock based upon the device name

Parameters

constchar*dev_id
device name
constchar*con_id
connection ID

Description

Returns a struct clk corresponding to the clock producer, orvalid IS_ERR() condition containing errno. The implementationusesdev_id andcon_id to determine the clock consumer, andthereby the clock producer. In contrast toclk_get() this functiontakes the device name instead of the device itself for identification.

Drivers must assume that the clock source is not enabled.

clk_get_sys should not be called from within interrupt context.

intclk_save_context(void)

save clock context for poweroff

Parameters

void
no arguments

Description

Saves the context of the clock register for powerstates in which thecontents of the registers will be lost. Occurs deep within the suspendcode so locking is not necessary.

voidclk_restore_context(void)

restore clock context after poweroff

Parameters

void
no arguments

Description

This occurs with all clocks enabled. Occurs deep within the resume codeso locking is not necessary.

struct clk *clk_get_optional(structdevice * dev, const char * id)

lookup and obtain a reference to an optional clock producer.

Parameters

structdevice*dev
device for clock “consumer”
constchar*id
clock consumer ID

Description

Behaves the same asclk_get() except where there is no clock producer. Inthis case, instead of returning -ENOENT, the function returns NULL.

Synchronization Primitives

Read-Copy Update (RCU)

RCU_NONIDLE(a)

Indicate idle-loop code that needs RCU readers

Parameters

a
Code that RCU needs to pay attention to.

Description

RCU read-side critical sections are forbidden in the inner idle loop,that is, between thercu_idle_enter() and thercu_idle_exit() – RCUwill happily ignore any such read-side critical sections. However,things like powertop need tracepoints in the inner idle loop.

This macro provides the way out: RCU_NONIDLE(do_something_with_RCU())will tell RCU that it needs to pay attention, invoke its argument(in this example, calling the do_something_with_RCU() function),and then tell RCU to go back to ignoring this CPU. It is permissibleto nestRCU_NONIDLE() wrappers, but not indefinitely (but the limit ison the order of a million or so, even on 32-bit systems). It isnot legal to block withinRCU_NONIDLE(), nor is it permissible totransfer control either into or out ofRCU_NONIDLE()’s statement.

cond_resched_tasks_rcu_qs()

Report potential quiescent states to RCU

Parameters

Description

This macro resembles cond_resched(), except that it is defined toreport potential quiescent states to RCU-tasks even if the cond_resched()machinery were to be shut off, as some advocate for PREEMPTION kernels.

RCU_LOCKDEP_WARN(c,s)

emit lockdep splat if specified condition is met

Parameters

c
condition to check
s
informative message
RCU_INITIALIZER(v)

statically initialize an RCU-protected global variable

Parameters

v
The value to statically initialize with.
rcu_assign_pointer(p,v)

assign to RCU-protected pointer

Parameters

p
pointer to assign to
v
value to assign (publish)

Description

Assigns the specified value to the specified RCU-protectedpointer, ensuring that any concurrent RCU readers will seeany prior initialization.

Inserts memory barriers on architectures that require them(which is most of them), and also prevents the compiler fromreordering the code that initializes the structure after the pointerassignment. More importantly, this call documents which pointerswill be dereferenced by RCU read-side code.

In some special cases, you may useRCU_INIT_POINTER() insteadofrcu_assign_pointer().RCU_INIT_POINTER() is a bit faster dueto the fact that it does not constrain either the CPU or the compiler.That said, usingRCU_INIT_POINTER() when you should have usedrcu_assign_pointer() is a very bad thing that results inimpossible-to-diagnose memory corruption. So please be careful.See theRCU_INIT_POINTER() comment header for details.

Note thatrcu_assign_pointer() evaluates each of its arguments onlyonce, appearances notwithstanding. One of the “extra” evaluationsis in typeof() and the other visible only to sparse (__CHECKER__),neither of which actually execute the argument. As with most cppmacros, this execute-arguments-only-once property is important, soplease be careful when making changes torcu_assign_pointer() and theother macros that it invokes.

rcu_replace_pointer(rcu_ptr,ptr,c)

replace an RCU pointer, returning its old value

Parameters

rcu_ptr
RCU pointer, whose old value is returned
ptr
regular pointer
c
the lockdep conditions under which the dereference will take place

Description

Perform a replacement, wherercu_ptr is an RCU-annotatedpointer andc is the lockdep argument that is passed to thercu_dereference_protected() call used to read that pointer. The oldvalue ofrcu_ptr is returned, andrcu_ptr is set toptr.

rcu_access_pointer(p)

fetch RCU pointer with no dereferencing

Parameters

p
The pointer to read

Description

Return the value of the specified RCU-protected pointer, but omit thelockdep checks for being in an RCU read-side critical section. This isuseful when the value of this pointer is accessed, but the pointer isnot dereferenced, for example, when testing an RCU-protected pointeragainst NULL. Althoughrcu_access_pointer() may also be used in caseswhere update-side locks prevent the value of the pointer from changing,you should instead usercu_dereference_protected() for this use case.

It is also permissible to usercu_access_pointer() when read-sideaccess to the pointer was removed at least one grace period ago, asis the case in the context of the RCU callback that is freeing upthe data, or after asynchronize_rcu() returns. This can be usefulwhen tearing down multi-linked structures after a grace periodhas elapsed.

rcu_dereference_check(p,c)

rcu_dereference with debug checking

Parameters

p
The pointer to read, prior to dereferencing
c
The conditions under which the dereference will take place

Description

Do anrcu_dereference(), but check that the conditions under which thedereference will take place are correct. Typically the conditionsindicate the various locking conditions that should be held at thatpoint. The check should return true if the conditions are satisfied.An implicit check for being in an RCU read-side critical section(rcu_read_lock()) is included.

For example:

bar = rcu_dereference_check(foo->bar, lockdep_is_held(foo->lock));

could be used to indicate to lockdep that foo->bar may only be dereferencedif eitherrcu_read_lock() is held, or that the lock required to replacethe bar struct at foo->bar is held.

Note that the list of conditions may also include indications of when a lockneed not be held, for example during initialisation or destruction of thetarget struct:

bar = rcu_dereference_check(foo->bar, lockdep_is_held(foo->lock) ||
atomic_read(foo->usage) == 0);

Inserts memory barriers on architectures that require them(currently only the Alpha), prevents the compiler from refetching(and from merging fetches), and, more importantly, documents exactlywhich pointers are protected by RCU and checks that the pointer isannotated as __rcu.

rcu_dereference_bh_check(p,c)

rcu_dereference_bh with debug checking

Parameters

p
The pointer to read, prior to dereferencing
c
The conditions under which the dereference will take place

Description

This is the RCU-bh counterpart torcu_dereference_check().

rcu_dereference_sched_check(p,c)

rcu_dereference_sched with debug checking

Parameters

p
The pointer to read, prior to dereferencing
c
The conditions under which the dereference will take place

Description

This is the RCU-sched counterpart torcu_dereference_check().

rcu_dereference_protected(p,c)

fetch RCU pointer when updates prevented

Parameters

p
The pointer to read, prior to dereferencing
c
The conditions under which the dereference will take place

Description

Return the value of the specified RCU-protected pointer, but omitthe READ_ONCE(). This is useful in cases where update-side locksprevent the value of the pointer from changing. Please note that thisprimitive doesnot prevent the compiler from repeating this referenceor combining it with other references, so it should not be used withoutprotection of appropriate locks.

This function is only for update-side use. Using this functionwhen protected only byrcu_read_lock() will result in infrequentbut very ugly failures.

rcu_dereference(p)

fetch RCU-protected pointer for dereferencing

Parameters

p
The pointer to read, prior to dereferencing

Description

This is a simple wrapper aroundrcu_dereference_check().

rcu_dereference_bh(p)

fetch an RCU-bh-protected pointer for dereferencing

Parameters

p
The pointer to read, prior to dereferencing

Description

Makesrcu_dereference_check() do the dirty work.

rcu_dereference_sched(p)

fetch RCU-sched-protected pointer for dereferencing

Parameters

p
The pointer to read, prior to dereferencing

Description

Makesrcu_dereference_check() do the dirty work.

rcu_pointer_handoff(p)

Hand off a pointer from RCU to other mechanism

Parameters

p
The pointer to hand off

Description

This is simply an identity function, but it documents where a pointeris handed off from RCU to some other synchronization mechanism, forexample, reference counting or locking. In C11, it would map tokill_dependency(). It could be used as follows:

rcu_read_lock();p = rcu_dereference(gp);long_lived = is_long_lived(p);if (long_lived) {        if (!atomic_inc_not_zero(p->refcnt))                long_lived = false;        else                p = rcu_pointer_handoff(p);}rcu_read_unlock();
voidrcu_read_lock(void)

mark the beginning of an RCU read-side critical section

Parameters

void
no arguments

Description

Whensynchronize_rcu() is invoked on one CPU while other CPUsare within RCU read-side critical sections, then thesynchronize_rcu() is guaranteed to block until after all the otherCPUs exit their critical sections. Similarly, ifcall_rcu() is invokedon one CPU while other CPUs are within RCU read-side criticalsections, invocation of the corresponding RCU callback is deferreduntil after the all the other CPUs exit their critical sections.

Note, however, that RCU callbacks are permitted to run concurrentlywith new RCU read-side critical sections. One way that this can happenis via the following sequence of events: (1) CPU 0 enters an RCUread-side critical section, (2) CPU 1 invokescall_rcu() to registeran RCU callback, (3) CPU 0 exits the RCU read-side critical section,(4) CPU 2 enters a RCU read-side critical section, (5) the RCUcallback is invoked. This is legal, because the RCU read-side criticalsection that was running concurrently with thecall_rcu() (and whichtherefore might be referencing something that the corresponding RCUcallback would free up) has completed before the correspondingRCU callback is invoked.

RCU read-side critical sections may be nested. Any deferred actionswill be deferred until the outermost RCU read-side critical sectioncompletes.

You can avoid reading and understanding the next paragraph byfollowing this rule: don’t put anything in anrcu_read_lock() RCUread-side critical section that would block in a !PREEMPTION kernel.But if you want the full story, read on!

In non-preemptible RCU implementations (pure TREE_RCU and TINY_RCU),it is illegal to block while in an RCU read-side critical section.In preemptible RCU implementations (PREEMPT_RCU) in CONFIG_PREEMPTIONkernel builds, RCU read-side critical sections may be preempted,but explicit blocking is illegal. Finally, in preemptible RCUimplementations in real-time (with -rt patchset) kernel builds, RCUread-side critical sections may be preempted and they may also block, butonly when acquiring spinlocks that are subject to priority inheritance.

voidrcu_read_unlock(void)

marks the end of an RCU read-side critical section.

Parameters

void
no arguments

Description

In most situations,rcu_read_unlock() is immune from deadlock.However, in kernels built with CONFIG_RCU_BOOST,rcu_read_unlock()is responsible for deboosting, which it does via rt_mutex_unlock().Unfortunately, this function acquires the scheduler’s runqueue andpriority-inheritance spinlocks. This means that deadlock could resultif the caller ofrcu_read_unlock() already holds one of these locks orany lock that is ever acquired while holding them.

That said, RCU readers are never priority boosted unless they werepreempted. Therefore, one way to avoid deadlock is to make surethat preemption never happens within any RCU read-side criticalsection whose outermostrcu_read_unlock() is called with one ofrt_mutex_unlock()’s locks held. Such preemption can be avoided ina number of ways, for example, by invoking preempt_disable() beforecritical section’s outermostrcu_read_lock().

Given that the set of locks acquired by rt_mutex_unlock() might changeat any time, a somewhat more future-proofed approach is to make surethat that preemption never happens within any RCU read-side criticalsection whose outermostrcu_read_unlock() is called with irqs disabled.This approach relies on the fact that rt_mutex_unlock() currently onlyacquires irq-disabled locks.

The second of these two approaches is best in most situations,however, the first approach can also be useful, at least to thosedevelopers willing to keep abreast of the set of locks acquired byrt_mutex_unlock().

Seercu_read_lock() for more information.

voidrcu_read_lock_bh(void)

mark the beginning of an RCU-bh critical section

Parameters

void
no arguments

Description

This is equivalent ofrcu_read_lock(), but also disables softirqs.Note that anything else that disables softirqs can also serve asan RCU read-side critical section.

Note thatrcu_read_lock_bh() and the matching rcu_read_unlock_bh()must occur in the same context, for example, it is illegal to invokercu_read_unlock_bh() from one task if the matchingrcu_read_lock_bh()was invoked from some other task.

voidrcu_read_lock_sched(void)

mark the beginning of a RCU-sched critical section

Parameters

void
no arguments

Description

This is equivalent ofrcu_read_lock(), but disables preemption.Read-side critical sections can also be introduced by anything elsethat disables preemption, including local_irq_disable() and friends.

Note thatrcu_read_lock_sched() and the matching rcu_read_unlock_sched()must occur in the same context, for example, it is illegal to invokercu_read_unlock_sched() from process context if the matchingrcu_read_lock_sched() was invoked from an NMI handler.

RCU_INIT_POINTER(p,v)

initialize an RCU protected pointer

Parameters

p
The pointer to be initialized.
v
The value to initialized the pointer to.

Description

Initialize an RCU-protected pointer in special cases where readersdo not need ordering constraints on the CPU or the compiler. Thesespecial cases are:

  1. This use ofRCU_INIT_POINTER() is NULLing out the pointeror
  2. The caller has taken whatever steps are required to preventRCU readers from concurrently accessing this pointeror
  3. The referenced data structure has already been exposed toreaders either at compile time or viarcu_assign_pointer()and
    1. You have not madeany reader-visible changes tothis structure since thenor
    2. It is OK for readers accessing this structure from itsnew location to see the old state of the structure. (Forexample, the changes were to statistical counters or toother state where exact synchronization is not required.)

Failure to follow these rules governing use ofRCU_INIT_POINTER() willresult in impossible-to-diagnose memory corruption. As in the structureswill look OK in crash dumps, but any concurrent RCU readers mightsee pre-initialized values of the referenced data structure. Soplease be very careful how you useRCU_INIT_POINTER()!!!

If you are creating an RCU-protected linked structure that is accessedby a single external-to-structure RCU-protected pointer, then you mayuseRCU_INIT_POINTER() to initialize the internal RCU-protectedpointers, but you must usercu_assign_pointer() to initialize theexternal-to-structure pointerafter you have completely initializedthe reader-accessible portions of the linked structure.

Note that unlikercu_assign_pointer(),RCU_INIT_POINTER() provides noordering guarantees for either the CPU or the compiler.

RCU_POINTER_INITIALIZER(p,v)

statically initialize an RCU protected pointer

Parameters

p
The pointer to be initialized.
v
The value to initialized the pointer to.

Description

GCC-style initialization for an RCU-protected pointer in a structure field.

kfree_rcu(ptr,rhf)

kfree an object after a grace period.

Parameters

ptr
pointer to kfree
rhf
the name of the struct rcu_head within the type ofptr.

Description

Many rcu callbacks functions just callkfree() on the base structure.These functions are trivial, but their size adds up, and furthermorewhen they are used in a kernel module, that module must invoke thehigh-latencyrcu_barrier() function at module-unload time.

Thekfree_rcu() function handles this issue. Rather than encoding afunction address in the embedded rcu_head structure,kfree_rcu() insteadencodes the offset of the rcu_head structure within the base structure.Because the functions are not allowed in the low-order 4096 bytes ofkernel virtual memory, offsets up to 4095 bytes can be accommodated.If the offset is larger than 4095 bytes, a compile-time error willbe generated in __kvfree_rcu(). If this error is triggered, you caneither fall back to use ofcall_rcu() or rearrange the structure toposition the rcu_head structure into the first 4096 bytes.

Note that the allowable offset might decrease in the future, for example,to allow something like kmem_cache_free_rcu().

The BUILD_BUG_ON check must not involve any function calls, hence thechecks are done in macros here.

kvfree_rcu()

kvfree an object after a grace period.

Parameters

...
variable arguments

Description

This macro consists of one or two arguments and it isbased on whether an object is head-less or not. If ithas a head then a semantic stays the same as it usedto be before:

kvfree_rcu(ptr, rhf);

whereptr is a pointer tokvfree(),rhf is the nameof the rcu_head structure within the type ofptr.

When it comes to head-less variant, only one argumentis passed and that is just a pointer which has to befreed after a grace period. Therefore the semantic is

kvfree_rcu(ptr);

whereptr is a pointer tokvfree().

Please note, head-less way of freeing is permitted touse from a context that has to followmight_sleep()annotation. Otherwise, please switch and embed thercu_head structure within the type ofptr.

voidrcu_head_init(struct rcu_head * rhp)

Initialize rcu_head forrcu_head_after_call_rcu()

Parameters

structrcu_head*rhp
The rcu_head structure to initialize.

Description

If you intend to invokercu_head_after_call_rcu() to test whether agiven rcu_head structure has already been passed tocall_rcu(), thenyou must also invoke thisrcu_head_init() function on it just afterallocating that structure. Calls to this function must not race withcalls tocall_rcu(),rcu_head_after_call_rcu(), or callback invocation.

boolrcu_head_after_call_rcu(struct rcu_head * rhp, rcu_callback_t f)

Has this rcu_head been passed tocall_rcu()?

Parameters

structrcu_head*rhp
The rcu_head structure to test.
rcu_callback_tf
The function passed tocall_rcu() along withrhp.

Description

Returnstrue if therhp has been passed tocall_rcu() withfunc,andfalse otherwise. Emits a warning in any other case, includingthe case whererhp has already been invoked after a grace period.Calls to this function must not race with callback invocation. One wayto avoid such races is to enclose the call torcu_head_after_call_rcu()in an RCU read-side critical section that includes a read-side fetchof the pointer to the structure containingrhp.

intrcu_is_cpu_rrupt_from_idle(void)

see if ‘interrupted’ from idle

Parameters

void
no arguments

Description

If the current CPU is idle and running at a first-level (not nested)interrupt, or directly, from idle, return true.

The caller must have at least disabled IRQs.

voidrcu_idle_enter(void)

inform RCU that current CPU is entering idle

Parameters

void
no arguments

Description

Enter idle mode, in other words, -leave- the mode in which RCUread-side critical sections can occur. (Though RCU read-sidecritical sections can occur in irq handlers in idle, a possibilityhandled by irq_enter() and irq_exit().)

If you add or remove a call torcu_idle_enter(), be sure to test withCONFIG_RCU_EQS_DEBUG=y.

noinstr voidrcu_user_enter(void)

inform RCU that we are resuming userspace.

Parameters

void
no arguments

Description

Enter RCU idle mode right before resuming userspace. No use of RCUis permitted between this call andrcu_user_exit(). This way theCPU doesn’t need to maintain the tick for RCU maintenance purposeswhen the CPU runs in userspace.

If you add or remove a call torcu_user_enter(), be sure to test withCONFIG_RCU_EQS_DEBUG=y.

noinstr voidrcu_nmi_exit(void)

inform RCU of exit from NMI context

Parameters

void
no arguments

Description

If we are returning from the outermost NMI handler that interrupted anRCU-idle period, update rdp->dynticks and rdp->dynticks_nmi_nestingto let the RCU grace-period handling know that the CPU is back tobeing RCU-idle.

If you add or remove a call torcu_nmi_exit(), be sure to testwith CONFIG_RCU_EQS_DEBUG=y.

void noinstrrcu_irq_exit(void)

inform RCU that current CPU is exiting irq towards idle

Parameters

void
no arguments

Description

Exit from an interrupt handler, which might possibly result in enteringidle mode, in other words, leaving the mode in which read-side criticalsections can occur. The caller must have disabled interrupts.

This code assumes that the idle loop never does anything that mightresult in unbalanced calls to irq_enter() and irq_exit(). If yourarchitecture’s idle loop violates this assumption, RCU will give you whatyou deserve, good and hard. But very infrequently and irreproducibly.

Use things like work queues to work around this limitation.

You have been warned.

If you add or remove a call torcu_irq_exit(), be sure to test withCONFIG_RCU_EQS_DEBUG=y.

voidrcu_irq_exit_preempt(void)

Inform RCU that current CPU is exiting irq towards in kernel preemption

Parameters

void
no arguments

Description

Same asrcu_irq_exit() but has a sanity check that scheduling is safefrom RCU point of view. Invoked from return from interrupt before kernelpreemption.

voidrcu_irq_exit_check_preempt(void)

Validate that scheduling is possible

Parameters

void
no arguments
voidrcu_idle_exit(void)

inform RCU that current CPU is leaving idle

Parameters

void
no arguments

Description

Exit idle mode, in other words, -enter- the mode in which RCUread-side critical sections can occur.

If you add or remove a call torcu_idle_exit(), be sure to test withCONFIG_RCU_EQS_DEBUG=y.

void noinstrrcu_user_exit(void)

inform RCU that we are exiting userspace.

Parameters

void
no arguments

Description

Exit RCU idle mode while entering the kernel because it canrun a RCU read side critical section anytime.

If you add or remove a call torcu_user_exit(), be sure to test withCONFIG_RCU_EQS_DEBUG=y.

void__rcu_irq_enter_check_tick(void)

Enable scheduler tick on CPU if RCU needs it.

Parameters

void
no arguments

Description

The scheduler tick is not normally enabled when CPUs enter the kernelfrom nohz_full userspace execution. After all, nohz_full userspaceexecution is an RCU quiescent state and the time executing in the kernelis quite short. Except of course when it isn’t. And it is not hard tocause a large system to spend tens of seconds or even minutes loopingin the kernel, which can cause a number of problems, include RCU CPUstall warnings.

Therefore, if a nohz_full CPU fails to report a quiescent statein a timely manner, the RCU grace-period kthread sets that CPU’s->rcu_urgent_qs flag with the expectation that the next interrupt orexception will invoke this function, which will turn on the schedulertick, which will enable RCU to detect that CPU’s quiescent states,for example, due to cond_resched() calls in CONFIG_PREEMPT=n kernels.The tick will be disabled once a quiescent state is reported forthis CPU.

Of course, in carefully tuned systems, there might never be aninterrupt or exception. In that case, the RCU grace-period kthreadwill eventually cause one to happen. However, in less carefullycontrolled environments, this function allows RCU to get what itneeds without creating otherwise useless interruptions.

noinstr voidrcu_nmi_enter(void)

inform RCU of entry to NMI context

Parameters

void
no arguments

Description

If the CPU was idle from RCU’s viewpoint, update rdp->dynticks andrdp->dynticks_nmi_nesting to let the RCU grace-period handling knowthat the CPU is active. This implementation permits nested NMIs, aslong as the nesting level does not overflow an int. (You will probablyrun out of stack space first.)

If you add or remove a call torcu_nmi_enter(), be sure to testwith CONFIG_RCU_EQS_DEBUG=y.

noinstr voidrcu_irq_enter(void)

inform RCU that current CPU is entering irq away from idle

Parameters

void
no arguments

Description

Enter an interrupt handler, which might possibly result in exitingidle mode, in other words, entering the mode in which read-side criticalsections can occur. The caller must have disabled interrupts.

Note that the Linux kernel is fully capable of entering an interrupthandler that it never exits, for example when doing upcalls to user mode!This code assumes that the idle loop never does upcalls to user mode.If your architecture’s idle loop does do upcalls to user mode (or doesanything else that results in unbalanced calls to the irq_enter() andirq_exit() functions), RCU will give you what you deserve, good and hard.But very infrequently and irreproducibly.

Use things like work queues to work around this limitation.

You have been warned.

If you add or remove a call torcu_irq_enter(), be sure to test withCONFIG_RCU_EQS_DEBUG=y.

boolrcu_is_watching(void)

see if RCU thinks that the current CPU is not idle

Parameters

void
no arguments

Description

Return true if RCU is watching the running CPU, which means that thisCPU can safely enter RCU read-side critical sections. In other words,if the current CPU is not in its idle loop or is in an interrupt orNMI handler, return true.

voidcall_rcu(struct rcu_head * head, rcu_callback_t func)

Queue an RCU callback for invocation after a grace period.

Parameters

structrcu_head*head
structure to be used for queueing the RCU updates.
rcu_callback_tfunc
actual callback function to be invoked after the grace period

Description

The callback function will be invoked some time after a full graceperiod elapses, in other words after all pre-existing RCU read-sidecritical sections have completed. However, the callback functionmight well execute concurrently with RCU read-side critical sectionsthat started aftercall_rcu() was invoked. RCU read-side criticalsections are delimited byrcu_read_lock() andrcu_read_unlock(), andmay be nested. In addition, regions of code across which interrupts,preemption, or softirqs have been disabled also serve as RCU read-sidecritical sections. This includes hardware interrupt handlers, softirqhandlers, and NMI handlers.

Note that all CPUs must agree that the grace period extended beyondall pre-existing RCU read-side critical section. On systems with morethan one CPU, this means that when “func()” is invoked, each CPU isguaranteed to have executed a full memory barrier since the end of itslast RCU read-side critical section whose beginning preceded the calltocall_rcu(). It also means that each CPU executing an RCU read-sidecritical section that continues beyond the start of “func()” must haveexecuted a memory barrier after thecall_rcu() but before the beginningof that RCU read-side critical section. Note that these guaranteesinclude CPUs that are offline, idle, or executing in user mode, aswell as CPUs that are executing in the kernel.

Furthermore, if CPU A invokedcall_rcu() and CPU B invoked theresulting RCU callback function “func()”, then both CPU A and CPU B areguaranteed to execute a full memory barrier during the time intervalbetween the call tocall_rcu() and the invocation of “func()” – evenif CPU A and CPU B are the same CPU (but again only if the system hasmore than one CPU).

structkvfree_rcu_bulk_data

single block to storekvfree_rcu() pointers

Definition

struct kvfree_rcu_bulk_data {  unsigned long nr_records;  struct kvfree_rcu_bulk_data *next;  void *records[];};

Members

nr_records
Number of active pointers in the array
next
Next bulk object in the block chain
records
Array of thekvfree_rcu() pointers
structkfree_rcu_cpu_work

single batch ofkfree_rcu() requests

Definition

struct kfree_rcu_cpu_work {  struct rcu_work rcu_work;  struct rcu_head *head_free;  struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS];  struct kfree_rcu_cpu *krcp;};

Members

rcu_work
Letqueue_rcu_work() invoke workqueue handler after grace period
head_free
List ofkfree_rcu() objects waiting for a grace period
bkvhead_free
Bulk-List ofkvfree_rcu() objects waiting for a grace period
krcp
Pointer tokfree_rcu_cpu structure
structkfree_rcu_cpu

batch upkfree_rcu() requests for RCU grace period

Definition

struct kfree_rcu_cpu {  struct rcu_head *head;  struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS];  struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];  raw_spinlock_t lock;  struct delayed_work monitor_work;  bool monitor_todo;  bool initialized;  int count;  struct llist_head bkvcache;  int nr_bkv_objs;};

Members

head
List ofkfree_rcu() objects not yet waiting for a grace period
bkvhead
Bulk-List ofkvfree_rcu() objects not yet waiting for a grace period
krw_arr
Array of batches ofkfree_rcu() objects waiting for a grace period
lock
Synchronize access to this structure
monitor_work
Promotehead tohead_free after KFREE_DRAIN_JIFFIES
monitor_todo
Tracks whether amonitor_work delayed work is pending
initialized
Thercu_work fields have been initialized
count
Number of objects for which GP not started

Description

This is a per-CPU structure. The reason that it is not included inthe rcu_data structure is to permit this code to be extracted fromthe RCU files. Such extraction could allow further optimization ofthe interactions with the slab allocators.

voidsynchronize_rcu(void)

wait until a grace period has elapsed.

Parameters

void
no arguments

Description

Control will return to the caller some time after a full graceperiod has elapsed, in other words after all currently executing RCUread-side critical sections have completed. Note, however, thatupon return fromsynchronize_rcu(), the caller might well be executingconcurrently with new RCU read-side critical sections that began whilesynchronize_rcu() was waiting. RCU read-side critical sections aredelimited byrcu_read_lock() andrcu_read_unlock(), and may be nested.In addition, regions of code across which interrupts, preemption, orsoftirqs have been disabled also serve as RCU read-side criticalsections. This includes hardware interrupt handlers, softirq handlers,and NMI handlers.

Note that this guarantee implies further memory-ordering guarantees.On systems with more than one CPU, whensynchronize_rcu() returns,each CPU is guaranteed to have executed a full memory barrier sincethe end of its last RCU read-side critical section whose beginningpreceded the call tosynchronize_rcu(). In addition, each CPU havingan RCU read-side critical section that extends beyond the return fromsynchronize_rcu() is guaranteed to have executed a full memory barrierafter the beginning ofsynchronize_rcu() and before the beginning ofthat RCU read-side critical section. Note that these guarantees includeCPUs that are offline, idle, or executing in user mode, as well as CPUsthat are executing in the kernel.

Furthermore, if CPU A invokedsynchronize_rcu(), which returnedto its caller on CPU B, then both CPU A and CPU B are guaranteedto have executed a full memory barrier during the execution ofsynchronize_rcu() – even if CPU A and CPU B are the same CPU (butagain only if the system has more than one CPU).

unsigned longget_state_synchronize_rcu(void)

Snapshot current RCU state

Parameters

void
no arguments

Description

Returns a cookie that is used by a later call tocond_synchronize_rcu()to determine whether or not a full grace period has elapsed in themeantime.

voidcond_synchronize_rcu(unsigned long oldstate)

Conditionally wait for an RCU grace period

Parameters

unsignedlongoldstate
return value from earlier call toget_state_synchronize_rcu()

Description

If a full RCU grace period has elapsed since the earlier call toget_state_synchronize_rcu(), just return. Otherwise, invokesynchronize_rcu() to wait for a full grace period.

Yes, this function does not take counter wrap into account. Butcounter wrap is harmless. If the counter wraps, we have waited formore than 2 billion grace periods (and way more on a 64-bit system!),so waiting for one additional grace period should be just fine.

voidrcu_barrier(void)

Wait until all in-flightcall_rcu() callbacks complete.

Parameters

void
no arguments

Description

Note that this primitive does not necessarily wait for an RCU grace periodto complete. For example, if there are no RCU callbacks queued anywherein the system, thenrcu_barrier() is within its rights to returnimmediately, without waiting for anything, much less an RCU grace period.

voidsynchronize_rcu_expedited(void)

Brute-force RCU grace period

Parameters

void
no arguments

Description

Wait for an RCU grace period, but expedite it. The basic idea is toIPI all non-idle non-nohz online CPUs. The IPI handler checks whetherthe CPU is in an RCU critical section, and if so, it sets a flag thatcauses the outermostrcu_read_unlock() to report the quiescent statefor RCU-preempt or asks the scheduler for help for RCU-sched. On theother hand, if the CPU is not in an RCU read-side critical section,the IPI handler reports the quiescent state immediately.

Although this is a great improvement over previous expeditedimplementations, it is still unfriendly to real-time workloads, so isthus not recommended for any sort of common-case code. In fact, ifyou are usingsynchronize_rcu_expedited() in a loop, please restructureyour code to batch your updates, and then use a singlesynchronize_rcu()instead.

This has the same semantics as (but is more brutal than)synchronize_rcu().

boolrcu_read_lock_held_common(bool * ret)

might we be in RCU-sched read-side critical section?

Parameters

bool*ret
Best guess answer if lockdep cannot be relied on

Description

Returns true if lockdep must be ignored, in which case*ret containsthe best guess described below. Otherwise returns false, in whichcase*ret tells the caller nothing and the caller should insteadconsult lockdep.

If CONFIG_DEBUG_LOCK_ALLOC is selected, set*ret to nonzero iff in anRCU-sched read-side critical section. In absence ofCONFIG_DEBUG_LOCK_ALLOC, this assumes we are in an RCU-sched read-sidecritical section unless it can prove otherwise. Note that disablingof preemption (including disabling irqs) counts as an RCU-schedread-side critical section. This is useful for debug checks in functionsthat required that they be called within an RCU-sched read-sidecritical section.

Check debug_lockdep_rcu_enabled() to prevent false positives during bootand while lockdep is disabled.

Note that if the CPU is in the idle loop from an RCU point of view (ie:that we are in the section betweenrcu_idle_enter() andrcu_idle_exit())thenrcu_read_lock_held() sets*ret to false even if the CPU did anrcu_read_lock(). The reason for this is that RCU ignores CPUs that arein such a section, considering these as in extended quiescent state,so such a CPU is effectively never in an RCU read-side critical sectionregardless of what RCU primitives it invokes. This state of affairs isrequired — we need to keep an RCU-free window in idle where the CPU maypossibly enter into low power mode. This way we can notice an extendedquiescent state to other CPUs that started a grace period. Otherwisewe would delay any grace period as long as we run in the idle task.

Similarly, we avoid claiming an RCU read lock held if the currentCPU is offline.

voidrcu_expedite_gp(void)

Expedite future RCU grace periods

Parameters

void
no arguments

Description

After a call to this function, future calls tosynchronize_rcu() andfriends act as the correspondingsynchronize_rcu_expedited() functionhad instead been called.

voidrcu_unexpedite_gp(void)

Cancel priorrcu_expedite_gp() invocation

Parameters

void
no arguments

Description

Undo a prior call torcu_expedite_gp(). If all prior calls torcu_expedite_gp() are undone by a subsequent call torcu_unexpedite_gp(),and if the rcu_expedited sysfs/boot parameter is not set, then allsubsequent calls tosynchronize_rcu() and friends will return totheir normal non-expedited behavior.

intrcu_read_lock_held(void)

might we be in RCU read-side critical section?

Parameters

void
no arguments

Description

If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an RCUread-side critical section. In absence of CONFIG_DEBUG_LOCK_ALLOC,this assumes we are in an RCU read-side critical section unless it canprove otherwise. This is useful for debug checks in functions thatrequire that they be called within an RCU read-side critical section.

Checks debug_lockdep_rcu_enabled() to prevent false positives during bootand while lockdep is disabled.

Note thatrcu_read_lock() and the matchingrcu_read_unlock() mustoccur in the same context, for example, it is illegal to invokercu_read_unlock() in process context if the matchingrcu_read_lock()was invoked from within an irq handler.

Note thatrcu_read_lock() is disallowed if the CPU is either idle oroffline from an RCU perspective, so check for those as well.

intrcu_read_lock_bh_held(void)

might we be in RCU-bh read-side critical section?

Parameters

void
no arguments

Description

Check for bottom half being disabled, which covers both theCONFIG_PROVE_RCU and not cases. Note that if someone usesrcu_read_lock_bh(), but then later enables BH, lockdep (if enabled)will show the situation. This is useful for debug checks in functionsthat require that they be called within an RCU read-side criticalsection.

Check debug_lockdep_rcu_enabled() to prevent false positives during boot.

Note thatrcu_read_lock_bh() is disallowed if the CPU is either idle oroffline from an RCU perspective, so check for those as well.

voidwakeme_after_rcu(struct rcu_head * head)

Callback function to awaken a task after grace period

Parameters

structrcu_head*head
Pointer to rcu_head member within rcu_synchronize structure

Description

Awaken the corresponding task now that a grace period has elapsed.

voidinit_rcu_head_on_stack(struct rcu_head * head)

initialize on-stack rcu_head for debugobjects

Parameters

structrcu_head*head
pointer to rcu_head structure to be initialized

Description

This function informs debugobjects of a new rcu_head structure thathas been allocated as an auto variable on the stack. This functionis not required for rcu_head structures that are statically defined orthat are dynamically allocated on the heap. This function has noeffect for !CONFIG_DEBUG_OBJECTS_RCU_HEAD kernel builds.

voiddestroy_rcu_head_on_stack(struct rcu_head * head)

destroy on-stack rcu_head for debugobjects

Parameters

structrcu_head*head
pointer to rcu_head structure to be initialized

Description

This function informs debugobjects that an on-stack rcu_head structureis about to go out of scope. As withinit_rcu_head_on_stack(), thisfunction is not required for rcu_head structures that are staticallydefined or that are dynamically allocated on the heap. Also as withinit_rcu_head_on_stack(), this function has no effect for!CONFIG_DEBUG_OBJECTS_RCU_HEAD kernel builds.

intsrcu_read_lock_held(const struct srcu_struct * ssp)

might we be in SRCU read-side critical section?

Parameters

conststructsrcu_struct*ssp
The srcu_struct structure to check

Description

If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an SRCUread-side critical section. In absence of CONFIG_DEBUG_LOCK_ALLOC,this assumes we are in an SRCU read-side critical section unless it canprove otherwise.

Checks debug_lockdep_rcu_enabled() to prevent false positives during bootand while lockdep is disabled.

Note that SRCU is based on its own statemachine and it doesn’trelies on normal RCU, it can be called from the CPU whichis in the idle loop from an RCU point of view or offline.

srcu_dereference_check(p,ssp,c)

fetch SRCU-protected pointer for later dereferencing

Parameters

p
the pointer to fetch and protect for later dereferencing
ssp
pointer to the srcu_struct, which is used to check that wereally are in an SRCU read-side critical section.
c
condition to check for update-side use

Description

If PROVE_RCU is enabled, invoking this outside of an RCU read-sidecritical section will result in an RCU-lockdep splat, unlessc evaluatesto 1. Thec argument will normally be a logical expression containinglockdep_is_held() calls.

srcu_dereference(p,ssp)

fetch SRCU-protected pointer for later dereferencing

Parameters

p
the pointer to fetch and protect for later dereferencing
ssp
pointer to the srcu_struct, which is used to check that wereally are in an SRCU read-side critical section.

Description

Makesrcu_dereference_check() do the dirty work. If PROVE_RCUis enabled, invoking this outside of an RCU read-side criticalsection will result in an RCU-lockdep splat.

srcu_dereference_notrace(p,ssp)

no tracing and no lockdep calls from here

Parameters

p
the pointer to fetch and protect for later dereferencing
ssp
pointer to the srcu_struct, which is used to check that wereally are in an SRCU read-side critical section.
intsrcu_read_lock(struct srcu_struct * ssp)

register a new reader for an SRCU-protected structure.

Parameters

structsrcu_struct*ssp
srcu_struct in which to register the new reader.

Description

Enter an SRCU read-side critical section. Note that SRCU read-sidecritical sections may be nested. However, it is illegal tocall anything that waits on an SRCU grace period for the samesrcu_struct, whether directly or indirectly. Please note thatone way to indirectly wait on an SRCU grace period is to acquirea mutex that is held elsewhere while callingsynchronize_srcu() orsynchronize_srcu_expedited().

Note thatsrcu_read_lock() and the matchingsrcu_read_unlock() mustoccur in the same context, for example, it is illegal to invokesrcu_read_unlock() in an irq handler if the matchingsrcu_read_lock()was invoked in process context.

voidsrcu_read_unlock(struct srcu_struct * ssp, int idx)

unregister a old reader from an SRCU-protected structure.

Parameters

structsrcu_struct*ssp
srcu_struct in which to unregister the old reader.
intidx
return value from correspondingsrcu_read_lock().

Description

Exit an SRCU read-side critical section.

voidsmp_mb__after_srcu_read_unlock(void)

ensure full ordering after srcu_read_unlock

Parameters

void
no arguments

Description

Converts the preceding srcu_read_unlock into a two-way memory barrier.

Call this after srcu_read_unlock, to guarantee that all memory operationsthat occur after smp_mb__after_srcu_read_unlock will appear to happen afterthe preceding srcu_read_unlock.

intinit_srcu_struct(struct srcu_struct * ssp)

initialize a sleep-RCU structure

Parameters

structsrcu_struct*ssp
structure to initialize.

Description

Must invoke this on a given srcu_struct before passing that srcu_structto any other function. Each srcu_struct represents a separate domainof SRCU protection.

boolsrcu_readers_active(struct srcu_struct * ssp)

returns true if there are readers. and false otherwise

Parameters

structsrcu_struct*ssp
which srcu_struct to count active readers (holding srcu_read_lock).

Description

Note that this is not an atomic primitive, and can therefore suffersevere errors when invoked on an active srcu_struct. That said, itcan be useful as an error check at cleanup time.

voidcleanup_srcu_struct(struct srcu_struct * ssp)

deconstruct a sleep-RCU structure

Parameters

structsrcu_struct*ssp
structure to clean up.

Description

Must invoke this after you are finished using a given srcu_struct thatwas initialized viainit_srcu_struct(), else you leak memory.

voidcall_srcu(struct srcu_struct * ssp, struct rcu_head * rhp, rcu_callback_t func)

Queue a callback for invocation after an SRCU grace period

Parameters

structsrcu_struct*ssp
srcu_struct in queue the callback
structrcu_head*rhp
structure to be used for queueing the SRCU callback.
rcu_callback_tfunc
function to be invoked after the SRCU grace period

Description

The callback function will be invoked some time after a full SRCUgrace period elapses, in other words after all pre-existing SRCUread-side critical sections have completed. However, the callbackfunction might well execute concurrently with other SRCU read-sidecritical sections that started aftercall_srcu() was invoked. SRCUread-side critical sections are delimited bysrcu_read_lock() andsrcu_read_unlock(), and may be nested.

The callback will be invoked from process context, but must neverthelessbe fast and must not block.

voidsynchronize_srcu_expedited(struct srcu_struct * ssp)

Brute-force SRCU grace period

Parameters

structsrcu_struct*ssp
srcu_struct with which to synchronize.

Description

Wait for an SRCU grace period to elapse, but be more aggressive aboutspinning rather than blocking when waiting.

Note thatsynchronize_srcu_expedited() has the same deadlock andmemory-ordering properties as doessynchronize_srcu().

voidsynchronize_srcu(struct srcu_struct * ssp)

wait for prior SRCU read-side critical-section completion

Parameters

structsrcu_struct*ssp
srcu_struct with which to synchronize.

Description

Wait for the count to drain to zero of both indexes. To avoid thepossible starvation ofsynchronize_srcu(), it waits for the count ofthe index=((->srcu_idx & 1) ^ 1) to drain to zero at first,and then flip the srcu_idx and wait for the count of the other index.

Can block; must be called from process context.

Note that it is illegal to callsynchronize_srcu() from the correspondingSRCU read-side critical section; doing so will result in deadlock.However, it is perfectly legal to callsynchronize_srcu() on onesrcu_struct from some other srcu_struct’s read-side critical section,as long as the resulting graph of srcu_structs is acyclic.

There are memory-ordering constraints implied bysynchronize_srcu().On systems with more than one CPU, whensynchronize_srcu() returns,each CPU is guaranteed to have executed a full memory barrier sincethe end of its last corresponding SRCU read-side critical sectionwhose beginning preceded the call tosynchronize_srcu(). In addition,each CPU having an SRCU read-side critical section that extends beyondthe return fromsynchronize_srcu() is guaranteed to have executed afull memory barrier after the beginning ofsynchronize_srcu() and beforethe beginning of that SRCU read-side critical section. Note that theseguarantees include CPUs that are offline, idle, or executing in user mode,as well as CPUs that are executing in the kernel.

Furthermore, if CPU A invokedsynchronize_srcu(), which returnedto its caller on CPU B, then both CPU A and CPU B are guaranteedto have executed a full memory barrier during the execution ofsynchronize_srcu(). This guarantee applies even if CPU A and CPU Bare the same CPU, but again only if the system has more than one CPU.

Of course, these memory-ordering guarantees apply only whensynchronize_srcu(),srcu_read_lock(), andsrcu_read_unlock() arepassed the same srcu_struct structure.

If SRCU is likely idle, expedite the first request. This semanticwas provided by Classic SRCU, and is relied upon by its users, so TREESRCU must also provide it. Note that detecting idleness is heuristicand subject to both false positives and negatives.

voidsrcu_barrier(struct srcu_struct * ssp)

Wait until all in-flightcall_srcu() callbacks complete.

Parameters

structsrcu_struct*ssp
srcu_struct on which to wait for in-flight callbacks.
unsigned longsrcu_batches_completed(struct srcu_struct * ssp)

return batches completed.

Parameters

structsrcu_struct*ssp
srcu_struct on which to report batch completion.

Description

Report the number of batches, correlated with, but not necessarilyprecisely the same as, the number of grace periods that have elapsed.

voidhlist_bl_del_rcu(struct hlist_bl_node * n)

deletes entry from hash list without re-initialization

Parameters

structhlist_bl_node*n
the element to delete from the hash list.

Note

hlist_bl_unhashed() on entry does not return true after this,the entry is in an undefined state. It is useful for RCU basedlockfree traversal.

Description

In particular, it means that we can not poison the forwardpointers that may still be used for walking the hash list.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_bl_add_head_rcu()orhlist_bl_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_bl_for_each_entry().

voidhlist_bl_add_head_rcu(struct hlist_bl_node * n, struct hlist_bl_head * h)

Parameters

structhlist_bl_node*n
the element to add to the hash list.
structhlist_bl_head*h
the list to add to.

Description

Adds the specified element to the specified hlist_bl,while permitting racing traversals.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_bl_add_head_rcu()orhlist_bl_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_bl_for_each_entry_rcu(), used to prevent memory-consistencyproblems on Alpha CPUs. Regardless of the type of CPU, thelist-traversal primitive must be guarded byrcu_read_lock().

hlist_bl_for_each_entry_rcu(tpos,pos,head,member)

iterate over rcu list of given type

Parameters

tpos
the type * to use as a loop cursor.
pos
thestructhlist_bl_node to use as a loop cursor.
head
the head for your list.
member
the name of the hlist_bl_node within the struct.
list_tail_rcu(head)

returns the prev pointer of the head of the list

Parameters

head
the head of the list

Note

This should only be used with the list header, and even thenonly iflist_del() and similar primitives are not also used on thelist header.

voidlist_add_rcu(struct list_head * new, struct list_head * head)

add a new entry to rcu-protected list

Parameters

structlist_head*new
new entry to be added
structlist_head*head
list head to add it after

Description

Insert a new entry after the specified head.This is good for implementing stacks.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such aslist_add_rcu()orlist_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such aslist_for_each_entry_rcu().

voidlist_add_tail_rcu(struct list_head * new, struct list_head * head)

add a new entry to rcu-protected list

Parameters

structlist_head*new
new entry to be added
structlist_head*head
list head to add it before

Description

Insert a new entry before the specified head.This is useful for implementing queues.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such aslist_add_tail_rcu()orlist_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such aslist_for_each_entry_rcu().

voidlist_del_rcu(struct list_head * entry)

deletes entry from list without re-initialization

Parameters

structlist_head*entry
the element to delete from the list.

Note

list_empty() on entry does not return true after this,the entry is in an undefined state. It is useful for RCU basedlockfree traversal.

Description

In particular, it means that we can not poison the forwardpointers that may still be used for walking the list.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such aslist_del_rcu()orlist_add_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such aslist_for_each_entry_rcu().

Note that the caller is not permitted to immediately freethe newly deleted entry. Instead, eithersynchronize_rcu()orcall_rcu() must be used to defer freeing until an RCUgrace period has elapsed.

voidhlist_del_init_rcu(struct hlist_node * n)

deletes entry from hash list with re-initialization

Parameters

structhlist_node*n
the element to delete from the hash list.

Note

list_unhashed() on the node return true after this. It isuseful for RCU based read lockfree traversal if the writer sidemust know if the list entry is still hashed or already unhashed.

Description

In particular, it means that we can not poison the forward pointersthat may still be used for walking the hash list and we can onlyzero the pprev pointer so list_unhashed() will return true afterthis.

The caller must take whatever precautions are necessary (such asholding appropriate locks) to avoid racing with anotherlist-mutation primitive, such ashlist_add_head_rcu() orhlist_del_rcu(), running on this same list. However, it isperfectly legal to run concurrently with the _rcu list-traversalprimitives, such ashlist_for_each_entry_rcu().

voidlist_replace_rcu(struct list_head * old, struct list_head * new)

replace old entry by new one

Parameters

structlist_head*old
the element to be replaced
structlist_head*new
the new element to insert

Description

Theold entry will be replaced with thenew entry atomically.

Note

old should not be empty.

void__list_splice_init_rcu(struct list_head * list, struct list_head * prev, struct list_head * next, void (*sync)(void))

join an RCU-protected list into an existing list.

Parameters

structlist_head*list
the RCU-protected list to splice
structlist_head*prev
points to the last element of the existing list
structlist_head*next
points to the first element of the existing list
void(*)(void)sync
synchronize_rcu, synchronize_rcu_expedited, …

Description

The list pointed to byprev andnext can be RCU-read traversedconcurrently with this function.

Note that this function blocks.

Important note: the caller must take whatever action is necessary to preventany other updates to the existing list. In principle, it is possible tomodify the list as soon as sync() begins execution. If this sort of thingbecomes necessary, an alternative version based oncall_rcu() could becreated. But only if -really- needed – there is no shortage of RCU APImembers.

voidlist_splice_init_rcu(struct list_head * list, struct list_head * head, void (*sync)(void))

splice an RCU-protected list into an existing list, designed for stacks.

Parameters

structlist_head*list
the RCU-protected list to splice
structlist_head*head
the place in the existing list to splice the first list into
void(*)(void)sync
synchronize_rcu, synchronize_rcu_expedited, …
voidlist_splice_tail_init_rcu(struct list_head * list, struct list_head * head, void (*sync)(void))

splice an RCU-protected list into an existing list, designed for queues.

Parameters

structlist_head*list
the RCU-protected list to splice
structlist_head*head
the place in the existing list to splice the first list into
void(*)(void)sync
synchronize_rcu, synchronize_rcu_expedited, …
list_entry_rcu(ptr,type,member)

get the struct for this entry

Parameters

ptr
thestructlist_head pointer.
type
the type of the struct this is embedded in.
member
the name of the list_head within the struct.

Description

This primitive may safely run concurrently with the _rcu list-mutationprimitives such aslist_add_rcu() as long as it’s guarded byrcu_read_lock().

list_first_or_null_rcu(ptr,type,member)

get the first element from a list

Parameters

ptr
the list head to take the element from.
type
the type of the struct this is embedded in.
member
the name of the list_head within the struct.

Description

Note that if the list is empty, it returns NULL.

This primitive may safely run concurrently with the _rcu list-mutationprimitives such aslist_add_rcu() as long as it’s guarded byrcu_read_lock().

list_next_or_null_rcu(head,ptr,type,member)

get the first element from a list

Parameters

head
the head for the list.
ptr
the list head to take the next element from.
type
the type of the struct this is embedded in.
member
the name of the list_head within the struct.

Description

Note that if the ptr is at the end of the list, NULL is returned.

This primitive may safely run concurrently with the _rcu list-mutationprimitives such aslist_add_rcu() as long as it’s guarded byrcu_read_lock().

list_for_each_entry_rcu(pos,head,member,cond)

iterate over rcu list of given type

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_head within the struct.
cond
optional lockdep expression if called from non-RCU protection.

Description

This list-traversal primitive may safely run concurrently withthe _rcu list-mutation primitives such aslist_add_rcu()as long as the traversal is guarded byrcu_read_lock().

list_entry_lockless(ptr,type,member)

get the struct for this entry

Parameters

ptr
thestructlist_head pointer.
type
the type of the struct this is embedded in.
member
the name of the list_head within the struct.

Description

This primitive may safely run concurrently with the _rculist-mutation primitives such aslist_add_rcu(), but requires someimplicit RCU read-side guarding. One example is running within a specialexception-time environment where preemption is disabled and where lockdepcannot be invoked. Another example is when items are added to the list,but never deleted.

list_for_each_entry_lockless(pos,head,member)

iterate over rcu list of given type

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_struct within the struct.

Description

This primitive may safely run concurrently with the _rculist-mutation primitives such aslist_add_rcu(), but requires someimplicit RCU read-side guarding. One example is running within a specialexception-time environment where preemption is disabled and where lockdepcannot be invoked. Another example is when items are added to the list,but never deleted.

list_for_each_entry_continue_rcu(pos,head,member)

continue iteration over list of given type

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_head within the struct.

Description

Continue to iterate over list of given type, continuing afterthe current position which must have been in the list when the RCU readlock was taken.This would typically require either that you obtained the node from aprevious walk of the list in the same RCU read-side critical section, orthat you held some sort of non-RCU reference (such as a reference count)to keep the node aliveand in the list.

This iterator is similar tolist_for_each_entry_from_rcu() exceptthis starts after the given position and that one starts at the givenposition.

list_for_each_entry_from_rcu(pos,head,member)

iterate over a list from current point

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the list_node within the struct.

Description

Iterate over the tail of a list starting from a given position,which must have been in the list when the RCU read lock was taken.This would typically require either that you obtained the node from aprevious walk of the list in the same RCU read-side critical section, orthat you held some sort of non-RCU reference (such as a reference count)to keep the node aliveand in the list.

This iterator is similar tolist_for_each_entry_continue_rcu() exceptthis starts from the given position and that one starts from the positionafter the given position.

voidhlist_del_rcu(struct hlist_node * n)

deletes entry from hash list without re-initialization

Parameters

structhlist_node*n
the element to delete from the hash list.

Note

list_unhashed() on entry does not return true after this,the entry is in an undefined state. It is useful for RCU basedlockfree traversal.

Description

In particular, it means that we can not poison the forwardpointers that may still be used for walking the hash list.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_add_head_rcu()orhlist_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_for_each_entry().

voidhlist_replace_rcu(struct hlist_node * old, struct hlist_node * new)

replace old entry by new one

Parameters

structhlist_node*old
the element to be replaced
structhlist_node*new
the new element to insert

Description

Theold entry will be replaced with thenew entry atomically.

voidhlists_swap_heads_rcu(struct hlist_head * left, struct hlist_head * right)

swap the lists the hlist heads point to

Parameters

structhlist_head*left
The hlist head on the left
structhlist_head*right
The hlist head on the right

Description

The lists start out as [left ][node1 … ] and
[right ][node2 … ]
The lists end up as [left ][node2 … ]
[right ][node1 … ]
voidhlist_add_head_rcu(struct hlist_node * n, struct hlist_head * h)

Parameters

structhlist_node*n
the element to add to the hash list.
structhlist_head*h
the list to add to.

Description

Adds the specified element to the specified hlist,while permitting racing traversals.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_add_head_rcu()orhlist_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_for_each_entry_rcu(), used to prevent memory-consistencyproblems on Alpha CPUs. Regardless of the type of CPU, thelist-traversal primitive must be guarded byrcu_read_lock().

voidhlist_add_tail_rcu(struct hlist_node * n, struct hlist_head * h)

Parameters

structhlist_node*n
the element to add to the hash list.
structhlist_head*h
the list to add to.

Description

Adds the specified element to the specified hlist,while permitting racing traversals.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_add_head_rcu()orhlist_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_for_each_entry_rcu(), used to prevent memory-consistencyproblems on Alpha CPUs. Regardless of the type of CPU, thelist-traversal primitive must be guarded byrcu_read_lock().

voidhlist_add_before_rcu(struct hlist_node * n, struct hlist_node * next)

Parameters

structhlist_node*n
the new element to add to the hash list.
structhlist_node*next
the existing element to add the new element before.

Description

Adds the specified element to the specified hlistbefore the specified node while permitting racing traversals.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_add_head_rcu()orhlist_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_for_each_entry_rcu(), used to prevent memory-consistencyproblems on Alpha CPUs.

voidhlist_add_behind_rcu(struct hlist_node * n, struct hlist_node * prev)

Parameters

structhlist_node*n
the new element to add to the hash list.
structhlist_node*prev
the existing element to add the new element after.

Description

Adds the specified element to the specified hlistafter the specified node while permitting racing traversals.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_add_head_rcu()orhlist_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_for_each_entry_rcu(), used to prevent memory-consistencyproblems on Alpha CPUs.

hlist_for_each_entry_rcu(pos,head,member,cond)

iterate over rcu list of given type

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the hlist_node within the struct.
cond
optional lockdep expression if called from non-RCU protection.

Description

This list-traversal primitive may safely run concurrently withthe _rcu list-mutation primitives such ashlist_add_head_rcu()as long as the traversal is guarded byrcu_read_lock().

hlist_for_each_entry_rcu_notrace(pos,head,member)

iterate over rcu list of given type (for tracing)

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the hlist_node within the struct.

Description

This list-traversal primitive may safely run concurrently withthe _rcu list-mutation primitives such ashlist_add_head_rcu()as long as the traversal is guarded byrcu_read_lock().

This is the same ashlist_for_each_entry_rcu() except that it doesnot do any RCU debugging or tracing.

hlist_for_each_entry_rcu_bh(pos,head,member)

iterate over rcu list of given type

Parameters

pos
the type * to use as a loop cursor.
head
the head for your list.
member
the name of the hlist_node within the struct.

Description

This list-traversal primitive may safely run concurrently withthe _rcu list-mutation primitives such ashlist_add_head_rcu()as long as the traversal is guarded byrcu_read_lock().

hlist_for_each_entry_continue_rcu(pos,member)

iterate over a hlist continuing after current point

Parameters

pos
the type * to use as a loop cursor.
member
the name of the hlist_node within the struct.
hlist_for_each_entry_continue_rcu_bh(pos,member)

iterate over a hlist continuing after current point

Parameters

pos
the type * to use as a loop cursor.
member
the name of the hlist_node within the struct.
hlist_for_each_entry_from_rcu(pos,member)

iterate over a hlist continuing from current point

Parameters

pos
the type * to use as a loop cursor.
member
the name of the hlist_node within the struct.
voidhlist_nulls_del_init_rcu(struct hlist_nulls_node * n)

deletes entry from hash list with re-initialization

Parameters

structhlist_nulls_node*n
the element to delete from the hash list.

Note

hlist_nulls_unhashed() on the node return true after this. It isuseful for RCU based read lockfree traversal if the writer sidemust know if the list entry is still hashed or already unhashed.

Description

In particular, it means that we can not poison the forward pointersthat may still be used for walking the hash list and we can onlyzero the pprev pointer so list_unhashed() will return true afterthis.

The caller must take whatever precautions are necessary (such asholding appropriate locks) to avoid racing with anotherlist-mutation primitive, such ashlist_nulls_add_head_rcu() orhlist_nulls_del_rcu(), running on this same list. However, it isperfectly legal to run concurrently with the _rcu list-traversalprimitives, such ashlist_nulls_for_each_entry_rcu().

hlist_nulls_first_rcu(head)

returns the first element of the hash list.

Parameters

head
the head of the list.
hlist_nulls_next_rcu(node)

returns the element of the list afternode.

Parameters

node
element of the list.
voidhlist_nulls_del_rcu(struct hlist_nulls_node * n)

deletes entry from hash list without re-initialization

Parameters

structhlist_nulls_node*n
the element to delete from the hash list.

Note

hlist_nulls_unhashed() on entry does not return true after this,the entry is in an undefined state. It is useful for RCU basedlockfree traversal.

Description

In particular, it means that we can not poison the forwardpointers that may still be used for walking the hash list.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_nulls_add_head_rcu()orhlist_nulls_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_nulls_for_each_entry().

voidhlist_nulls_add_head_rcu(struct hlist_nulls_node * n, struct hlist_nulls_head * h)

Parameters

structhlist_nulls_node*n
the element to add to the hash list.
structhlist_nulls_head*h
the list to add to.

Description

Adds the specified element to the specified hlist_nulls,while permitting racing traversals.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_nulls_add_head_rcu()orhlist_nulls_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_nulls_for_each_entry_rcu(), used to prevent memory-consistencyproblems on Alpha CPUs. Regardless of the type of CPU, thelist-traversal primitive must be guarded byrcu_read_lock().

voidhlist_nulls_add_tail_rcu(struct hlist_nulls_node * n, struct hlist_nulls_head * h)

Parameters

structhlist_nulls_node*n
the element to add to the hash list.
structhlist_nulls_head*h
the list to add to.

Description

Adds the specified element to the specified hlist_nulls,while permitting racing traversals.

The caller must take whatever precautions are necessary(such as holding appropriate locks) to avoid racingwith another list-mutation primitive, such ashlist_nulls_add_head_rcu()orhlist_nulls_del_rcu(), running on this same list.However, it is perfectly legal to run concurrently withthe _rcu list-traversal primitives, such ashlist_nulls_for_each_entry_rcu(), used to prevent memory-consistencyproblems on Alpha CPUs. Regardless of the type of CPU, thelist-traversal primitive must be guarded byrcu_read_lock().

hlist_nulls_for_each_entry_rcu(tpos,pos,head,member)

iterate over rcu list of given type

Parameters

tpos
the type * to use as a loop cursor.
pos
thestructhlist_nulls_node to use as a loop cursor.
head
the head of the list.
member
the name of the hlist_nulls_node within the struct.

Description

The barrier() is needed to make sure compiler doesn’t cache first element [1],as this loop can be restarted [2][1] Documentation/core-api/atomic_ops.rst around line 114[2] Documentation/RCU/rculist_nulls.rst around line 146

hlist_nulls_for_each_entry_safe(tpos,pos,head,member)

iterate over list of given type safe against removal of list entry

Parameters

tpos
the type * to use as a loop cursor.
pos
thestructhlist_nulls_node to use as a loop cursor.
head
the head of the list.
member
the name of the hlist_nulls_node within the struct.
boolrcu_sync_is_idle(struct rcu_sync * rsp)

Are readers permitted to use their fastpaths?

Parameters

structrcu_sync*rsp
Pointer to rcu_sync structure to use for synchronization

Description

Returns true if readers are permitted to use their fastpaths. Must beinvoked within some flavor of RCU read-side critical section.

voidrcu_sync_init(struct rcu_sync * rsp)

Initialize an rcu_sync structure

Parameters

structrcu_sync*rsp
Pointer to rcu_sync structure to be initialized
voidrcu_sync_enter_start(struct rcu_sync * rsp)

Force readers onto slow path for multiple updates

Parameters

structrcu_sync*rsp
Pointer to rcu_sync structure to use for synchronization

Description

Must be called afterrcu_sync_init() and before first use.

Ensuresrcu_sync_is_idle() returns false and rcu_sync_{enter,exit}()pairs turn into NO-OPs.

voidrcu_sync_func(struct rcu_head * rhp)

Callback function managing reader access to fastpath

Parameters

structrcu_head*rhp
Pointer to rcu_head in rcu_sync structure to use for synchronization

Description

This function is passed tocall_rcu() function byrcu_sync_enter() andrcu_sync_exit(), so that it is invoked after a grace period following thethat invocation of enter/exit.

If it is called byrcu_sync_enter() it signals that all the readers wereswitched onto slow path.

If it is called byrcu_sync_exit() it takes action based on events thathave taken place in the meantime, so that closely spacedrcu_sync_enter()andrcu_sync_exit() pairs need not wait for a grace period.

If anotherrcu_sync_enter() is invoked before the grace periodended, reset state to allow the nextrcu_sync_exit() to let thereaders back onto their fastpaths (after a grace period). If bothanotherrcu_sync_enter() and its matchingrcu_sync_exit() are invokedbefore the grace period ended, re-invokecall_rcu() on behalf of thatrcu_sync_exit(). Otherwise, set all state back to idle so that readerscan again use their fastpaths.

voidrcu_sync_enter(struct rcu_sync * rsp)

Force readers onto slowpath

Parameters

structrcu_sync*rsp
Pointer to rcu_sync structure to use for synchronization

Description

This function is used by updaters who need readers to make use ofa slowpath during the update. After this function returns, allsubsequent calls torcu_sync_is_idle() will return false, whichtells readers to stay off their fastpaths. A later call torcu_sync_exit() re-enables reader slowpaths.

When called in isolation,rcu_sync_enter() must wait for a graceperiod, however, closely spaced calls torcu_sync_enter() canoptimize away the grace-period wait via a state machine implementedbyrcu_sync_enter(),rcu_sync_exit(), andrcu_sync_func().

voidrcu_sync_exit(struct rcu_sync * rsp)

Allow readers back onto fast path after grace period

Parameters

structrcu_sync*rsp
Pointer to rcu_sync structure to use for synchronization

Description

This function is used by updaters who have completed, and can thereforenow allow readers to make use of their fastpaths after a grace periodhas elapsed. After this grace period has completed, all subsequentcalls torcu_sync_is_idle() will return true, which tells readers thatthey can once again use their fastpaths.

voidrcu_sync_dtor(struct rcu_sync * rsp)

Clean up an rcu_sync structure

Parameters

structrcu_sync*rsp
Pointer to rcu_sync structure to be cleaned up