Python Enhancement Proposals

Python »
PEP Index »
PEP 209

PEP 209 – Multi-dimensional Arrays

Author:: Paul Barrett <barrett at stsci.edu>, Travis Oliphant <oliphant at ee.byu.edu>
Status:

Important

This PEP has been withdrawn.

Abstract

This PEP proposes a redesign and re-implementation of themulti-dimensional array module, Numeric, to make it easier to addnew features and functionality to the module. Aspects of Numeric 2that will receive special attention are efficient access to arraysexceeding a gigabyte in size and composed of inhomogeneous datastructures or records. The proposed design uses four Pythonclasses: ArrayType, UFunc, Array, and ArrayView; and a low-levelC-extension module, _ufunc, to handle the array operationsefficiently. In addition, each array type has its own C-extensionmodule which defines the coercion rules, operations, and methodsfor that type. This design enables new types, features, andfunctionality to be added in a modular fashion. The new versionwill introduce some incompatibilities with the current Numeric.

Motivation

Multi-dimensional arrays are commonly used to store and manipulatedata in science, engineering, and computing. Python currently hasan extension module, named Numeric (henceforth called Numeric 1),which provides a satisfactory set of functionality for usersmanipulating homogeneous arrays of data of moderate size (of order10 MB). For access to larger arrays (of order 100 MB or more) ofpossibly inhomogeneous data, the implementation of Numeric 1 isinefficient and cumbersome. In the future, requests by theNumerical Python community for additional functionality is alsolikely as PEPs 211: Adding New Linear Operators to Python, and225: Elementwise/Objectwise Operators illustrate.

Proposal

This proposal recommends a re-design and re-implementation ofNumeric 1, henceforth called Numeric 2, which will enable newtypes, features, and functionality to be added in an easy andmodular manner. The initial design of Numeric 2 should focus onproviding a generic framework for manipulating arrays of varioustypes and should enable a straightforward mechanism for adding newarray types and UFuncs. Functional methods that are more specificto various disciplines can then be layered on top of this core.This new module will still be called Numeric and most of thebehavior found in Numeric 1 will be preserved.

The proposed design uses four Python classes: ArrayType, UFunc,Array, and ArrayView; and a low-level C-extension module to handlethe array operations efficiently. In addition, each array typehas its own C-extension module which defines the coercion rules,operations, and methods for that type. At a later date, when corefunctionality is stable, some Python classes can be converted toC-extension types.

Some planned features are:

Improved memory usage
This feature is particularly important when handling large arraysand can produce significant improvements in performance as well asmemory usage. We have identified several areas where memory usagecan be improved:
1. Use a local coercion model
  Instead of using Python’s global coercion model which createstemporary arrays, Numeric 2, like Numeric 1, will implement alocal coercion model as described inPEP 208 which defers theresponsibility of coercion to the operator. By using internalbuffers, a coercion operation can be done for each array(including output arrays), if necessary, at the time of theoperation. Benchmarks[1] have shown that performance is atmost degraded only slightly and is improved in cases where theinternal buffers are less than the L2 cache size and theprocessor is under load. To avoid array coercion altogether,C functions having arguments of mixed type are allowed inNumeric 2.
2. Avoid creation of temporary arrays
  In complex array expressions (i.e. having more than oneoperation), each operation will create a temporary array whichwill be used and then deleted by the succeeding operation. Abetter approach would be to identify these temporary arraysand reuse their data buffers when possible, namely when thearray shape and type are the same as the temporary array beingcreated. This can be done by checking the temporary array’sreference count. If it is 1, then it will be deleted once theoperation is done and is a candidate for reuse.
3. Optional use of memory-mapped files
  Numeric users sometimes need to access data from very largefiles or to handle data that is greater than the availablememory. Memory-mapped arrays provide a mechanism to do thisby storing the data on disk while making it appear to be inmemory. Memory- mapped arrays should improve access to allfiles by eliminating one of two copy steps during a fileaccess. Numeric should be able to access in-memory andmemory-mapped arrays transparently.
4. Record access
  In some fields of science, data is stored in files as binaryrecords. For example, in astronomy, photon data is stored as a1 dimensional list of photons in order of arrival time. Theserecords or C-like structures contain information about thedetected photon, such as its arrival time, its position on thedetector, and its energy. Each field may be of a differenttype, such as char, int, or float. Such arrays introduce newissues that must be dealt with, in particular byte alignmentor byte swapping may need to be performed for the numericvalues to be properly accessed (though byte swapping is alsoan issue for memory mapped data). Numeric 2 is designed toautomatically handle alignment and representational issueswhen data is accessed or operated on. There are twoapproaches to implementing records; as either a derived arrayclass or a special array type, depending on your point-of-view.We defer this discussion to the Open Issues section.
Additional array types
Numeric 1 has 11 defined types: char, ubyte, sbyte, short, int,long, float, double, cfloat, cdouble, and object. There are noushort, uint, or ulong types, nor are there more complex typessuch as a bit type which is of use to some fields of science andpossibly for implementing masked-arrays. The design of Numeric 1makes the addition of these and other types a difficult anderror-prone process. To enable the easy addition (and deletion)of new array types such as a bit type described below, a re-designof Numeric is necessary.
1. Bit type
  The result of a rich comparison between arrays is an array ofboolean values. The result can be stored in an array of typechar, but this is an unnecessary waste of memory. A betterimplementation would use a bit or boolean type, compressingthe array size by a factor of eight. This is currently beingimplemented for Numeric 1 (by Travis Oliphant) and should beincluded in Numeric 2.
Enhanced array indexing syntax
The extended slicing syntax was added to Python to provide greaterflexibility when manipulating Numeric arrays by allowingstep-sizes greater than 1. This syntax works well as a shorthandfor a list of regularly spaced indices. For those situationswhere a list of irregularly spaced indices are needed, an enhancedarray indexing syntax would allow 1-D arrays to be arguments.
Rich comparisons
The implementation ofPEP 207: Rich Comparisons in Python 2.1provides additional flexibility when manipulating arrays. Weintend to implement this feature in Numeric 2.
Array broadcasting rules
When an operation between a scalar and an array is done, theimplied behavior is to create a new array having the same shape asthe array operand containing the scalar value. This is calledarray broadcasting. It also works with arrays of lesser rank,such as vectors. This implicit behavior is implemented in Numeric1 and will also be implemented in Numeric 2.

Design and Implementation

The design of Numeric 2 has four primary classes:

ArrayType:
This is a simple class that describes the fundamental propertiesof an array-type, e.g. its name, its size in bytes, its coercionrelations with respect to other types, etc., e.g.
```
Int32=ArrayType('Int32',4,'doc-string')
```
Its relation to the other types is defined when the C-extensionmodule for that type is imported. The corresponding Python codeis:
```
Int32.astype[Real64]=Real64
```
This says that the Real64 array-type has higher priority than theInt32 array-type.
The following attributes and methods are proposed for the coreimplementation. Additional attributes can be added on anindividual basis, e.g. .bitsize or .bitstrides for the bit type.
Attributes:
```
.name:e.g."Int32","Float64",etc..typecode:e.g.'i','f',etc.(forbackwardcompatibility).size(inbytes):e.g.4,8,etc..array_rules(mapping):rulesbetweenarraytypes.pyobj_rules(mapping):rulesbetweenarrayandpythontypes.doc:documentationstring
```
Methods:
```
__init__():initialization__del__():destruction__repr__():representation
```
C-API: This still needs to be fleshed-out.
UFunc:
This class is the heart of Numeric 2. Its design is similar tothat of ArrayType in that the UFunc creates a singleton callableobject whose attributes are name, total and input number ofarguments, a document string, and an empty CFunc dictionary; e.g.
```
add=UFunc('add',3,2,'doc-string')
```
When defined the add instance has no C functions associated withit and therefore can do no work. The CFunc dictionary ispopulated or registered later when the C-extension module for anarray-type is imported. The arguments of the register method are:function name, function descriptor, and the CUFunc object. Thecorresponding Python code is
```
add.register('add',(Int32,Int32,Int32),cfunc-add)
```
In the initialization function of an array type module, e.g.Int32, there are two C API functions: one to initialize thecoercion rules and the other to register the CFunc objects.
When an operation is applied to some arrays, the__call__ methodis invoked. It gets the type of each array (if the output arrayis not given, it is created from the coercion rules) and checksthe CFunc dictionary for a key that matches the argument types.If it exists the operation is performed immediately, otherwise thecoercion rules are used to search for a related operation and setof conversion functions. The__call__ method then invokes acompute method written in C to iterate over slices of each array,namely:
```
_ufunc.compute(slice,data,func,swap,conv)
```
The ‘func’ argument is a CFuncObject, while the ‘swap’ and ‘conv’arguments are lists of CFuncObjects for those arrays needing pre- orpost-processing, otherwise None is used. The data argument isa list of buffer objects, and the slice argument gives the numberof iterations for each dimension along with the buffer offset andstep size for each array and each dimension.
We have predefined several UFuncs for use by the__call__ method:cast, swap, getobj, and setobj. The cast and swap functions docoercion and byte-swapping, respectively and the getobj and setobjfunctions do coercion between Numeric arrays and Python sequences.
The following attributes and methods are proposed for the coreimplementation.
Attributes:
```
.name:e.g."add","subtract",etc..nargs:numberoftotalarguments.iargs:numberofinputarguments.cfuncs(mapping):thesetCfunctions.doc:documentationstring
```
Methods:
```
__init__():initialization__del__():destruction__repr__():representation__call__():look-upanddispatchmethodinitrule():initializecoercionruleuninitrule():uninitializecoercionruleregister():registeraCUFuncunregister():unregisteraCUFunc
```
C-API: This still needs to be fleshed-out.

Array:

This class contains information about the array, such as shape,type, endian-ness of the data, etc.. Its operators, ‘+’, ‘-‘,etc. just invoke the corresponding UFunc function, e.g.

def__add__(self,other):returnufunc.add(self,other)

The following attributes, methods, and functions are proposed forthe core implementation.

Attributes:

.shape:shapeofthearray.format:typeofthearray.real(onlycomplex):realpartofacomplexarray.imag(onlycomplex):imaginarypartofacomplexarray

Methods:

__init__():initialization__del__():destruction__repr_():representation__str__():prettyrepresentation__cmp__():richcomparison__len__():__getitem__():__setitem__():__getslice__():__setslice__():numericmethods:copy():copyofarrayaslist():createlistfromarrayasstring():createstringfromarray

Functions:

fromlist():createarrayfromsequencefromstring():createarrayfromstringarray():createarraywithshapeandvalueconcat():concatenatetwoarraysresize():resizearray

C-API: This still needs to be fleshed-out.

ArrayView
This class is similar to the Array class except that the reshapeand flat methods will raise exceptions, since non-contiguousarrays cannot be reshaped or flattened using just pointer andstep-size information.
C-API: This still needs to be fleshed-out.
C-extension modules:
Numeric2 will have several C-extension modules.
1. _ufunc:
  The primary module of this set is the _ufuncmodule.c. Theintention of this module is to do the bare minimum,i.e. iterate over arrays using a specified C function. Theinterface of these functions is the same as Numeric 1, i.e.
```
int(*CFunc)(char*data,int*steps,intrepeat,void*func);
```
  and their functionality is expected to be the same, i.e. theyiterate over the inner-most dimension.
  The following attributes and methods are proposed for the coreimplementation.
  Attributes:
  Methods:
```
compute():
```
  C-API: This still needs to be fleshed-out.
2. _int32, _real64, etc.:
  There will also be C-extension modules for each array type,e.g. _int32module.c, _real64module.c, etc. As mentionedpreviously, when these modules are imported by the UFuncmodule, they will automatically register their functions andcoercion rules. New or improved versions of these modules canbe easily implemented and used without affecting the rest ofNumeric 2.

Open Issues

Does slicing syntax default to copy or view behavior?
The default behavior of Python is to return a copy of a sub-listor tuple when slicing syntax is used, whereas Numeric 1 returns aview into the array. The choice made for Numeric 1 is apparentlyfor reasons of performance: the developers wish to avoid thepenalty of allocating and copying the data buffer during eacharray operation and feel that the need for a deep copy of an arrayto be rare. Yet, some have argued that Numeric’s slice notationshould also have copy behavior to be consistent with Python lists.In this case the performance penalty associated with copy behaviorcan be minimized by implementing copy-on-write. This scheme hasboth arrays sharing one data buffer (as in view behavior) untileither array is assigned new data at which point a copy of thedata buffer is made. View behavior would then be implemented byan ArrayView class, whose behavior be similar to Numeric 1 arrays,i.e. .shape is not settable for non-contiguous arrays. The use ofan ArrayView class also makes explicit what type of data the arraycontains.
Does item syntax default to copy or view behavior?
A similar question arises with the item syntax. For example, ifa=[[0,1,2],[3,4,5]] andb=a[0], then changingb[0] also changesa[0][0], becausea[0] is a reference or view of the first row of a.Therefore, if c is a 2-d array, it would appear thatc[i]should return a 1-d array which is a view into, instead of a copyof, c for consistency. Yet,c[i] can be considered just ashorthand forc[i,:] which would imply copy behavior assumingslicing syntax returns a copy. Should Numeric 2 behave the sameway as lists and return a view or should it return a copy.
How is scalar coercion implemented?
Python has fewer numeric types than Numeric which can causecoercion problems. For example, when multiplying a Python scalarof type float and a Numeric array of type float, the Numeric arrayis converted to a double, since the Python float type is actuallya double. This is often not the desired behavior, since theNumeric array will be doubled in size which is likely to beannoying, particularly for very large arrays. We prefer that thearray type trumps the python type for the same type class, namelyinteger, float, and complex. Therefore, an operation between aPython integer and an Int16 (short) array will return an Int16array. Whereas an operation between a Python float and an Int16array would return a Float64 (double) array. Operations betweentwo arrays use normal coercion rules.
How is integer division handled?
In a future version of Python, the behavior of integer divisionwill change. The operands will be converted to floats, so theresult will be a float. If we implement the proposed scalarcoercion rules where arrays have precedence over Python scalars,then dividing an array by an integer will return an integer arrayand will not be consistent with a future version of Python whichwould return an array of type double. Scientific programmers arefamiliar with the distinction between integer and float-pointdivision, so should Numeric 2 continue with this behavior?
How should records be implemented?
There are two approaches to implementing records depending on yourpoint-of-view. The first is two divide arrays into separateclasses depending on the behavior of their types. For example,numeric arrays are one class, strings a second, and records athird, because the range and type of operations of each classdiffer. As such, a record array is not a new type, but amechanism for a more flexible form of array. To easily access andmanipulate such complex data, the class is comprised of numericarrays having different byte offsets into the data buffer. Forexample, one might have a table consisting of an array of Int16,Real32 values. Two numeric arrays, one with an offset of 0 bytesand a stride of 6 bytes to be interpreted as Int16, and one with anoffset of 2 bytes and a stride of 6 bytes to be interpreted asReal32 would represent the record array. Both numeric arrayswould refer to the same data buffer, but have different offset andstride attributes, and a different numeric type.
The second approach is to consider a record as one of many arraytypes, albeit with fewer, and possibly different, array operationsthan for numeric arrays. This approach considers an array type tobe a mapping of a fixed-length string. The mapping can either besimple, like integer and floating-point numbers, or complex, likea complex number, a byte string, and a C-structure. The recordtype effectively merges the struct and Numeric modules into amulti-dimensional struct array. This approach implies certainchanges to the array interface. For example, the ‘typecode’keyword argument should probably be changed to the moredescriptive ‘format’ keyword.
1. How are record semantics defined and implemented?
  Which ever implementation approach is taken for records, thesyntax and semantics of how they are to be accessed andmanipulated must be decided, if one wishes to have access tosub-fields of records. In this case, the record type canessentially be considered an inhomogeneous list, like a tuplereturned by the unpack method of the struct module; and a 1-darray of records may be interpreted as a 2-d array with thesecond dimension being the index into the list of fields.This enhanced array semantics makes access to an array of oneor more of the fields easy and straightforward. It alsoallows a user to do array operations on a field in a naturaland intuitive way. If we assume that records are implementedas an array type, then last dimension defaults to 0 and cantherefore be neglected for arrays comprised of simple types,like numeric.
How are masked-arrays implemented?
Masked-arrays in Numeric 1 are implemented as a separate arrayclass. With the ability to add new array types to Numeric 2, itis possible that masked-arrays in Numeric 2 could be implementedas a new array type instead of an array class.
How are numerical errors handled (IEEE floating-point errors inparticular)?
It is not clear to the proposers (Paul Barrett and TravisOliphant) what is the best or preferred way of handling errors.Since most of the C functions that do the operation, iterate overthe inner-most (last) dimension of the array. This dimensioncould contain a thousand or more items having one or more errorsof differing type, such as divide-by-zero, underflow, andoverflow. Additionally, keeping track of these errors may come atthe expense of performance. Therefore, we suggest severaloptions:
1. Print a message of the most severe error, leaving it tothe user to locate the errors.
2. Print a message of all errors that occurred and the numberof occurrences, leaving it to the user to locate the errors.
3. Print a message of all errors that occurred and a list ofwhere they occurred.
4. Or use a hybrid approach, printing only the most severeerror, yet keeping track of what and where the errorsoccurred. This would allow the user to locate the errorswhile keeping the error message brief.
What features are needed to ease the integration of FORTRANlibraries and code?

It would be a good idea at this stage to consider how to ease theintegration of FORTRAN libraries and user code in Numeric 2.

Implementation Steps

Implement basic UFunc capability
1. Minimal Array class:
  Necessary class attributes and methods, e.g. .shape, .data,.type, etc.
2. Minimal ArrayType class:
  Int32, Real64, Complex64, Char, Object
3. Minimal UFunc class:
  UFunc instantiation, CFunction registration, UFunc call for1-D arrays including the rules for doing alignment,byte-swapping, and coercion.
4. Minimal C-extension module:
  _UFunc, which does the innermost array loop in C.
  This step implements whatever is needed to do: ‘c = add(a, b)’where a, b, and c are 1-D arrays. It teaches us how to addnew UFuncs, to coerce the arrays, to pass the necessaryinformation to a C iterator method and to do the actuallycomputation.
Continue enhancing the UFunc iterator and Array class
1. Implement some access methods for the Array class:print, repr, getitem, setitem, etc.
2. Implement multidimensional arrays
3. Implement some of basic Array methods using UFuncs:+, -, *, /, etc.
4. Enable UFuncs to use Python sequences.
Complete the standard UFunc and Array class behavior
1. Implement getslice and setslice behavior
2. Work on Array broadcasting rules
3. Implement Record type
Add additional functionality
1. Add more UFuncs
2. Implement buffer or mmap access

Incompatibilities

The following is a list of incompatibilities in behavior betweenNumeric 1 and Numeric 2.

Scalar coercion rules
Numeric 1 has single set of coercion rules for array and Pythonnumeric types. This can cause unexpected and annoying problemsduring the calculation of an array expression. Numeric 2 intendsto overcome these problems by having two sets of coercion rules:one for arrays and Python numeric types, and another just forarrays.
No savespace attribute
The savespace attribute in Numeric 1 makes arrays with thisattribute set take precedence over those that do not have it set.Numeric 2 will not have such an attribute and therefore normalarray coercion rules will be in effect.
Slicing syntax returns a copy
The slicing syntax in Numeric 1 returns a view into the originalarray. The slicing behavior for Numeric 2 will be a copy. Youshould use the ArrayView class to get a view into an array.
Boolean comparisons return a boolean array
A comparison between arrays in Numeric 1 results in a Booleanscalar, because of current limitations in Python. The advent ofRich Comparisons in Python 2.1 will allow an array of Booleans tobe returned.
Type characters are deprecated
Numeric 2 will have an ArrayType class composed of Type instances,for example Int8, Int16, Int32, and Int for signed integers. Thetypecode scheme in Numeric 1 will be available for backwardcompatibility, but will be deprecated.

Appendices

Implicit sub-arrays iteration
A computer animation is composed of a number of 2-D images orframes of identical shape. By stacking these images into a singleblock of memory, a 3-D array is created. Yet the operations to beperformed are not meant for the entire 3-D array, but on the setof 2-D sub-arrays. In most array languages, each frame has to beextracted, operated on, and then reinserted into the output arrayusing a for-like loop. The J language allows the programmer toperform such operations implicitly by having a rank for the frameand array. By default these ranks will be the same during thecreation of the array. It was the intention of the Numeric 1developers to implement this feature, since it is based on thelanguage J. The Numeric 1 code has the required variables forimplementing this behavior, but was never implemented. We intendto implement implicit sub-array iteration in Numeric 2, if thearray broadcasting rules found in Numeric 1 do not fully supportthis behavior.

Copyright

This document is placed in the public domain.

Related PEPs

PEP 207: Rich Comparisonsby Guido van Rossum and David Ascher
PEP 208: Reworking the Coercion Modelby Neil Schemenauer and Marc-Andre’ Lemburg
PEP 211: Adding New Linear Algebra Operators to Pythonby Greg Wilson
PEP 225: Elementwise/Objectwise Operatorsby Huaiyu Zhu
PEP 228: Reworking Python’s Numeric Modelby Moshe Zadka

References

[1]

Greenfield 2000. private communication.

Source:https://github.com/python/peps/blob/main/peps/pep-0209.rst

Last modified:2024-04-14 13:35:25 GMT

Movatterモバイル変換

PEP 209 – Multi-dimensional Arrays