NumPy C Code Explanations¶

Fanaticism consists of redoubling your efforts when you have forgottenyour aim.—George Santayana
An authority is a person who can tell you more about something thanyou really care to know.—Unknown

This Chapter attempts to explain the logic behind some of the newpieces of code. The purpose behind these explanations is to enablesomebody to be able to understand the ideas behind the implementationsomewhat more easily than just staring at the code. Perhaps in thisway, the algorithms can be improved on, borrowed from, and/oroptimized.

Memory model¶

One fundamental aspect of the ndarray is that an array is seen as a“chunk” of memory starting at some location. The interpretation ofthis memory depends on the stride information. For each dimension inanN -dimensional array, an integer (stride) dictates how manybytes must be skipped to get to the next element in that dimension.Unless you have a single-segment array, this stride information mustbe consulted when traversing through an array. It is not difficult towrite code that accepts strides, you just have to use (char *)pointers because strides are in units of bytes. Keep in mind also thatstrides do not have to be unit-multiples of the element size. Also,remember that if the number of dimensions of the array is 0 (sometimescalled a rank-0 array), then the strides and dimensions variables areNULL.

Besides the structural information contained in the strides anddimensions members of thePyArrayObject, the flags containimportant information about how the data may be accessed. In particular,theNPY_ARRAY_ALIGNED flag is set when the memory is on asuitable boundary according to the data-type array. Even if you havea contiguous chunk of memory, you cannot just assume it is safe todereference a data- type-specific pointer to an element. Only if theNPY_ARRAY_ALIGNED flag is set is this a safe operation (onsome platforms it will work but on others, like Solaris, it will causea bus error). TheNPY_ARRAY_WRITEABLE should also be ensuredif you plan on writing to the memory area of the array. It is alsopossible to obtain a pointer to an unwritable memory area. Sometimes,writing to the memory area when theNPY_ARRAY_WRITEABLE flag is notset will just be rude. Other times it can cause program crashes (e.g.a data-area that is a read-only memory-mapped file).

Data-type encapsulation¶

The data-type is an important abstraction of the ndarray. Operationswill look to the data-type to provide the key functionality that isneeded to operate on the array. This functionality is provided in thelist of function pointers pointed to by the ‘f’ member of thePyArray_Descr structure. In this way, the number of data-types can beextended simply by providing aPyArray_Descr structure with suitablefunction pointers in the ‘f’ member. For built-in types there are someoptimizations that by-pass this mechanism, but the point of the data-type abstraction is to allow new data-types to be added.

One of the built-in data-types, the void data-type allows forarbitrary structured types containing 1 or more fields as elements of thearray. A field is simply another data-type object along with an offsetinto the current structured type. In order to support arbitrarily nestedfields, several recursive implementations of data-type access areimplemented for the void type. A common idiom is to cycle through theelements of the dictionary and perform a specific operation based onthe data-type object stored at the given offset. These offsets can bearbitrary numbers. Therefore, the possibility of encountering mis-aligned data must be recognized and taken into account if necessary.

N-D Iterators¶

A very common operation in much of NumPy code is the need to iterateover all the elements of a general, strided, N-dimensional array. Thisoperation of a general-purpose N-dimensional loop is abstracted in thenotion of an iterator object. To write an N-dimensional loop, you onlyhave to create an iterator object from an ndarray, work with thedataptr member of the iterator object structure and call the macroPyArray_ITER_NEXT (it) on the iterator object to move to the nextelement. The “next” element is always in C-contiguous order. The macroworks by first special casing the C-contiguous, 1-D, and 2-D caseswhich work very simply.

For the general case, the iteration works by keeping track of a listof coordinate counters in the iterator object. At each iteration, thelast coordinate counter is increased (starting from 0). If thiscounter is smaller than one less than the size of the array in thatdimension (a pre-computed and stored value), then the counter isincreased and the dataptr member is increased by the strides in thatdimension and the macro ends. If the end of a dimension is reached,the counter for the last dimension is reset to zero and the dataptr ismoved back to the beginning of that dimension by subtracting thestrides value times one less than the number of elements in thatdimension (this is also pre-computed and stored in the backstridesmember of the iterator object). In this case, the macro does not end,but a local dimension counter is decremented so that the next-to-lastdimension replaces the role that the last dimension played and thepreviously-described tests are executed again on the next-to-lastdimension. In this way, the dataptr is adjusted appropriately forarbitrary striding.

The coordinates member of thePyArrayIterObject structure maintainsthe current N-d counter unless the underlying array is C-contiguous inwhich case the coordinate counting is by-passed. The index member ofthePyArrayIterObject keeps track of the current flat index of theiterator. It is updated by thePyArray_ITER_NEXT macro.

Broadcasting¶

In Numeric, broadcasting was implemented in several lines of codeburied deep in ufuncobject.c. In NumPy, the notion of broadcasting hasbeen abstracted so that it can be performed in multiple places.Broadcasting is handled by the functionPyArray_Broadcast. Thisfunction requires aPyArrayMultiIterObject (or something that is abinary equivalent) to be passed in. ThePyArrayMultiIterObject keepstrack of the broadcast number of dimensions and size in eachdimension along with the total size of the broadcast result. It alsokeeps track of the number of arrays being broadcast and a pointer toan iterator for each of the arrays being broadcast.

ThePyArray_Broadcast function takes the iterators that have alreadybeen defined and uses them to determine the broadcast shape in eachdimension (to create the iterators at the same time that broadcastingoccurs then use thePyMultiIter_New function). Then, the iterators areadjusted so that each iterator thinks it is iterating over an arraywith the broadcast size. This is done by adjusting the iteratorsnumber of dimensions, and the shape in each dimension. This worksbecause the iterator strides are also adjusted. Broadcasting onlyadjusts (or adds) length-1 dimensions. For these dimensions, thestrides variable is simply set to 0 so that the data-pointer for theiterator over that array doesn’t move as the broadcasting operationoperates over the extended dimension.

Broadcasting was always implemented in Numeric using 0-valued stridesfor the extended dimensions. It is done in exactly the same way inNumPy. The big difference is that now the array of strides is kepttrack of in aPyArrayIterObject, the iterators involved in abroadcast result are kept track of in aPyArrayMultiIterObject,and thePyArray_BroadCast call implements the broad-casting rules.

Array Scalars¶

The array scalars offer a hierarchy of Python types that allow a one-to-one correspondence between the data-type stored in an array and thePython-type that is returned when an element is extracted from thearray. An exception to this rule was made with object arrays. Objectarrays are heterogeneous collections of arbitrary Python objects. Whenyou select an item from an object array, you get back the originalPython object (and not an object array scalar which does exist but israrely used for practical purposes).

The array scalars also offer the same methods and attributes as arrayswith the intent that the same code can be used to support arbitrarydimensions (including 0-dimensions). The array scalars are read-only(immutable) with the exception of the void scalar which can also bewritten to so that structured array field setting works more naturally(a[0][‘f1’] =value ).

Indexing¶

All python indexing operationsarr[index] are organized by first preparingthe index and finding the index type. The supported index types are:

integer
newaxis
slice
ellipsis
integer arrays/array-likes (fancy)
boolean (single boolean array); if there is more than one boolean array asindex or the shape does not match exactly, the boolean array will beconverted to an integer array instead.
0-d boolean (and also integer); 0-d boolean arrays are a specialcase which has to be handled in the advanced indexing code. They signalthat a 0-d boolean array had to be interpreted as an integer array.

As well as the scalar array special case signaling that an integer arraywas interpreted as an integer index, which is important because an integerarray index forces a copy but is ignored if a scalar is returned (full integerindex). The prepared index is guaranteed to be valid with the exception ofout of bound values and broadcasting errors for advanced indexing. Thisincludes that an ellipsis is added for incomplete indices for example whena two dimensional array is indexed with a single integer.

The next step depends on the type of index which was found. If alldimensions are indexed with an integer a scalar is returned or set. Asingle boolean indexing array will call specialized boolean functions.Indices containing an ellipsis or slice but no advanced indexing willalways create a view into the old array by calculating the new strides andmemory offset. This view can then either be returned or, for assignments,filled usingPyArray_CopyObject. Note thatPyArray_CopyObjectmay also be called on temporary arrays in other branches to supportcomplicated assignments when the array is of object dtype.

Advanced indexing¶

By far the most complex case is advanced indexing, which may or may not becombined with typical view based indexing. Here integer indices areinterpreted as view based. Before trying to understand this, you may wantto make yourself familiar with its subtleties. The advanced indexing codehas three different branches and one special case:

There is one indexing array and it, as well as the assignment array, canbe iterated trivially. For example they may be contiguous. Also theindexing array must be ofintp type and the value array in assignmentsshould be of the correct type. This is purely a fast path.
There are only integer array indices so that no subarray exists.
View based and advanced indexing is mixed. In this case the view basedindexing defines a collection of subarrays that are combined by theadvanced indexing. For example,arr[[1,2,3],:] is created byvertically stacking the subarraysarr[1,:],arr[2,:], andarr[3,:].
There is a subarray but it has exactly one element. This case can be handledas if there is no subarray, but needs some care during setup.

Deciding what case applies, checking broadcasting, and determining the kindof transposition needed are all done inPyArray_MapIterNew. After settingup, there are two cases. If there is no subarray or it only has oneelement, no subarray iteration is necessary and an iterator is preparedwhich iterates all indexing arraysas well as the result or value array.If there is a subarray, there are three iterators prepared. One for theindexing arrays, one for the result or value array (minus its subarray),and one for the subarrays of the original and the result/assignment array.The first two iterators give (or allow calculation) of the pointers intothe start of the subarray, which then allows to restart the subarrayiteration.

When advanced indices are next to each other transposing may be necessary.All necessary transposing is handled byPyArray_MapIterSwapAxes andhas to be handled by the caller unlessPyArray_MapIterNew is asked toallocate the result.

After preparation, getting and setting is relatively straight forward,although the different modes of iteration need to be considered. Unlessthere is only a single indexing array during item getting, the validity ofthe indices is checked beforehand. Otherwise it is handled in the innerloop itself for optimization.

Universal Functions¶

Universal functions are callable objects that takeN inputsand produceM outputs by wrapping basic 1-D loops that workelement-by-element into full easy-to use functions that seamlesslyimplement broadcasting, type-checking and buffered coercion, andoutput-argument handling. New universal functions are normally createdin C, although there is a mechanism for creating ufuncs from Pythonfunctions (frompyfunc). The user must supply a 1-D loop thatimplements the basic function taking the input scalar values andplacing the resulting scalars into the appropriate output slots asexplained in implementation.

Setup¶

Every ufunc calculation involves some overhead related to setting upthe calculation. The practical significance of this overhead is thateven though the actual calculation of the ufunc is very fast, you willbe able to write array and type-specific code that will work fasterfor small arrays than the ufunc. In particular, using ufuncs toperform many calculations on 0-D arrays will be slower than otherPython-based solutions (the silently-imported scalarmath module existsprecisely to give array scalars the look-and-feel of ufunc basedcalculations with significantly reduced overhead).

When a ufunc is called, many things must be done. The informationcollected from these setup operations is stored in a loop-object. Thisloop object is a C-structure (that could become a Python object but isnot initialized as such because it is only used internally). This loopobject has the layout needed to be used with PyArray_Broadcast so thatthe broadcasting can be handled in the same way as it is handled inother sections of code.

The first thing done is to look-up in the thread-specific globaldictionary the current values for the buffer-size, the error mask, andthe associated error object. The state of the error mask controls whathappens when an error condition is found. It should be noted thatchecking of the hardware error flags is only performed after each 1-Dloop is executed. This means that if the input and output arrays arecontiguous and of the correct type so that a single 1-D loop isperformed, then the flags may not be checked until all elements of thearray have been calculated. Looking up these values in a thread-specific dictionary takes time which is easily ignored for all butvery small arrays.

After checking, the thread-specific global variables, the inputs areevaluated to determine how the ufunc should proceed and the input andoutput arrays are constructed if necessary. Any inputs which are notarrays are converted to arrays (using context if necessary). Which ofthe inputs are scalars (and therefore converted to 0-D arrays) isnoted.

Next, an appropriate 1-D loop is selected from the 1-D loops availableto the ufunc based on the input array types. This 1-D loop is selectedby trying to match the signature of the data-types of the inputsagainst the available signatures. The signatures corresponding tobuilt-in types are stored in the types member of the ufunc structure.The signatures corresponding to user-defined types are stored in alinked-list of function-information with the head element stored as aCObject in the userloops dictionary keyed by the data-type number(the first user-defined type in the argument list is used as the key).The signatures are searched until a signature is found to which theinput arrays can all be cast safely (ignoring any scalar argumentswhich are not allowed to determine the type of the result). Theimplication of this search procedure is that “lesser types” should beplaced below “larger types” when the signatures are stored. If no 1-Dloop is found, then an error is reported. Otherwise, the argument_listis updated with the stored signature — in case casting is necessaryand to fix the output types assumed by the 1-D loop.

If the ufunc has 2 inputs and 1 output and the second input is anObject array then a special-case check is performed so thatNotImplemented is returned if the second input is not an ndarray, hasthe __array_priority__ attribute, and has an __r{op}__ specialmethod. In this way, Python is signaled to give the other object achance to complete the operation instead of using generic object-arraycalculations. This allows (for example) sparse matrices to overridethe multiplication operator 1-D loop.

For input arrays that are smaller than the specified buffer size,copies are made of all non-contiguous, mis-aligned, or out-of-byteorder arrays to ensure that for small arrays, a single loop isused. Then, array iterators are created for all the input arrays andthe resulting collection of iterators is broadcast to a single shape.

The output arguments (if any) are then processed and any missingreturn arrays are constructed. If any provided output array doesn’thave the correct type (or is mis-aligned) and is smaller than thebuffer size, then a new output array is constructed with the specialWRITEBACKIFCOPY flag set. At the end of the function,PyArray_ResolveWritebackIfCopy is called so thatits contents will be copied back into the output array.Iterators for the output arguments are then processed.

Finally, the decision is made about how to execute the loopingmechanism to ensure that all elements of the input arrays are combinedto produce the output arrays of the correct type. The options for loopexecution are one-loop (for contiguous, aligned, and correct datatype), strided-loop (for non-contiguous but still aligned and correctdata type), and a buffered loop (for mis-aligned or incorrect datatype situations). Depending on which execution method is called for,the loop is then setup and computed.

Function call¶

This section describes how the basic universal function computation loop issetup and executed for each of the three different kinds of execution. IfNPY_ALLOW_THREADS is defined during compilation, then as long asno object arrays are involved, the Python Global Interpreter Lock (GIL) isreleased prior to calling the loops. It is re-acquired if necessary tohandle error conditions. The hardware error flags are checked only afterthe 1-D loop is completed.

One Loop¶

This is the simplest case of all. The ufunc is executed by calling theunderlying 1-D loop exactly once. This is possible only when we havealigned data of the correct type (including byte-order) for both inputand output and all arrays have uniform strides (either contiguous,0-D, or 1-D). In this case, the 1-D computational loop is called onceto compute the calculation for the entire array. Note that thehardware error flags are only checked after the entire calculation iscomplete.

Strided Loop¶

When the input and output arrays are aligned and of the correct type,but the striding is not uniform (non-contiguous and 2-D or larger),then a second looping structure is employed for the calculation. Thisapproach converts all of the iterators for the input and outputarguments to iterate over all but the largest dimension. The innerloop is then handled by the underlying 1-D computational loop. Theouter loop is a standard iterator loop on the converted iterators. Thehardware error flags are checked after each 1-D loop is completed.

Buffered Loop¶

This is the code that handles the situation whenever the input and/oroutput arrays are either misaligned or of the wrong data-type(including being byte-swapped) from what the underlying 1-D loopexpects. The arrays are also assumed to be non-contiguous. The codeworks very much like the strided-loop except for the inner 1-D loop ismodified so that pre-processing is performed on the inputs and post-processing is performed on the outputs in bufsize chunks (wherebufsize is a user-settable parameter). The underlying 1-Dcomputational loop is called on data that is copied over (if it needsto be). The setup code and the loop code is considerably morecomplicated in this case because it has to handle:

memory allocation of the temporary buffers
deciding whether or not to use buffers on the input and output data(mis-aligned and/or wrong data-type)
copying and possibly casting data for any inputs or outputs for whichbuffers are necessary.
special-casing Object arrays so that reference counts are properlyhandled when copies and/or casts are necessary.
breaking up the inner 1-D loop into bufsize chunks (with a possibleremainder).

Again, the hardware error flags are checked at the end of each 1-Dloop.

Final output manipulation¶

Ufuncs allow other array-like classes to be passed seamlessly throughthe interface in that inputs of a particular class will induce theoutputs to be of that same class. The mechanism by which this works isthe following. If any of the inputs are not ndarrays and define the__array_wrap__ method, then the class with the largest__array_priority__ attribute determines the type of all theoutputs (with the exception of any output arrays passed in). The__array_wrap__ method of the input array will be called with thendarray being returned from the ufunc as it’s input. There are twocalling styles of the__array_wrap__ function supported. The firsttakes the ndarray as the first argument and a tuple of “context” asthe second argument. The context is (ufunc, arguments, output argumentnumber). This is the first call tried. If a TypeError occurs, then thefunction is called with just the ndarray as the first argument.

Methods¶

There are three methods of ufuncs that require calculation similar tothe general-purpose ufuncs. These are reduce, accumulate, andreduceat. Each of these methods requires a setup command followed by aloop. There are four loop styles possible for the methodscorresponding to no-elements, one-element, strided-loop, and buffered-loop. These are the same basic loop styles as implemented for thegeneral purpose function call except for the no-element and one-element cases which are special-cases occurring when the input arrayobjects have 0 and 1 elements respectively.

Setup¶

The setup function for all three methods isconstruct_reduce.This function creates a reducing loop object and fills it withparameters needed to complete the loop. All of the methods only workon ufuncs that take 2-inputs and return 1 output. Therefore, theunderlying 1-D loop is selected assuming a signature of [otype,otype,otype ] whereotype is the requested reductiondata-type. The buffer size and error handling is then retrieved from(per-thread) global storage. For small arrays that are mis-aligned orhave incorrect data-type, a copy is made so that the un-bufferedsection of code is used. Then, the looping strategy is selected. Ifthere is 1 element or 0 elements in the array, then a simple loopingmethod is selected. If the array is not mis-aligned and has thecorrect data-type, then strided looping is selected. Otherwise,buffered looping must be performed. Looping parameters are thenestablished, and the return array is constructed. The output array isof a different shape depending on whether the method is reduce,accumulate, or reduceat. If an output array is already provided, thenit’s shape is checked. If the output array is not C-contiguous,aligned, and of the correct data type, then a temporary copy is madewith the WRITEBACKIFCOPY flag set. In this way, the methods will be ableto work with a well-behaved output array but the result will be copiedback into the true output array whenPyArray_ResolveWritebackIfCopyis called at function completion.Finally, iterators are set up to loop over the correct axis(depending on the value of axis provided to the method) and the setuproutine returns to the actual computation routine.

Reduce¶

All of the ufunc methods use the same underlying 1-D computationalloops with input and output arguments adjusted so that the appropriatereduction takes place. For example, the key to the functioning ofreduce is that the 1-D loop is called with the output and the secondinput pointing to the same position in memory and both having a step-size of 0. The first input is pointing to the input array with a step-size given by the appropriate stride for the selected axis. In thisway, the operation performed is

\begin{align*}o & = & i[0] \\o & = & i[k]\textrm{<op>}o\quad k=1\ldots N\end{align*}

whereN+1 is the number of elements in the input,i,o is the output, andi[k] is thek^{\textrm{th}} element ofi along the selected axis.This basic operations is repeated for arrays with greater than 1dimension so that the reduction takes place for every 1-D sub-arrayalong the selected axis. An iterator with the selected dimensionremoved handles this looping.

For buffered loops, care must be taken to copy and cast data beforethe loop function is called because the underlying loop expectsaligned data of the correct data-type (including byte-order). Thebuffered loop must handle this copying and casting prior to callingthe loop function on chunks no greater than the user-specifiedbufsize.

Accumulate¶

The accumulate function is very similar to the reduce function in thatthe output and the second input both point to the output. Thedifference is that the second input points to memory one stride behindthe current output pointer. Thus, the operation performed is

\begin{align*}o[0] & = & i[0] \\o[k] & = & i[k]\textrm{<op>}o[k-1]\quad k=1\ldots N.\end{align*}

The output has the same shape as the input and each 1-D loop operatesoverN elements when the shape in the selected axis isN+1.Again, buffered loops take care to copy and cast the data beforecalling the underlying 1-D computational loop.

Reduceat¶

The reduceat function is a generalization of both the reduce andaccumulate functions. It implements a reduce over ranges of the inputarray specified by indices. The extra indices argument is checked tobe sure that every input is not too large for the input array alongthe selected dimension before the loop calculations take place. Theloop implementation is handled using code that is very similar to thereduce code repeated as many times as there are elements in theindices input. In particular: the first input pointer passed to theunderlying 1-D computational loop points to the input array at thecorrect location indicated by the index array. In addition, the outputpointer and the second input pointer passed to the underlying 1-D looppoint to the same position in memory. The size of the 1-Dcomputational loop is fixed to be the difference between the currentindex and the next index (when the current index is the last index,then the next index is assumed to be the length of the array along theselected dimension). In this way, the 1-D loop will implement a reduceover the specified indices.

Mis-aligned or a loop data-type that does not match the input and/oroutput data-type is handled using buffered code where-in data iscopied to a temporary buffer and cast to the correct data-type ifnecessary prior to calling the underlying 1-D function. The temporarybuffers are created in (element) sizes no bigger than the usersettable buffer-size value. Thus, the loop must be flexible enough tocall the underlying 1-D computational loop enough times to completethe total calculation in chunks no bigger than the buffer-size.

NumPy C Code Explanations

Previous topic

NumPy internals

Next topic

NumPy and SWIG

Movatterモバイル変換

NumPy C Code Explanations¶

Memory model¶

Data-type encapsulation¶

N-D Iterators¶

Broadcasting¶

Array Scalars¶

Indexing¶

Advanced indexing¶

Universal Functions¶

Setup¶

Function call¶

One Loop¶

Strided Loop¶

Buffered Loop¶

Final output manipulation¶

Methods¶

Setup¶

Reduce¶

Accumulate¶

Reduceat¶

Table Of Contents

Previous topic

Next topic

Quick search