data-apis/array-apiPublic

NotificationsYou must be signed in to change notification settings
Fork54
Star262

Need for dealing with strides when writing performant code against multiple array libraries#782

rgommers started this conversation inRelated topics

rgommers

Jun 15, 2023

· 8 comments

Return to top

Discussion options

rgommers
Jun 15, 2023
Maintainer

Early on when putting the array API standard together, we made a decision that strides must not be exposed in the Python API, because multiple array libraries don't support that and it's memory layout which is more implementation detail than an API that should be exposed to Python users. We made that decision so early that there isn't even a dedicated issue for it I believe; the closest one isgh-571 about C vs. Fortran memory order. It's also related togh-24 (strides are mentioned explicitly there).

The strides topic came up a couple of times recently though, and we should think about this more:

In the SciPy proceedings paper one of the main review comments was about this:Paper: Python Array API Standard: Toward Array Interoperability in the Scientific Python Ecosystem scipy-conference/scipy_proceedings#822
In the SciPy RFC for array API standard adoption it quickly came up because one of the first functions for which a conversion was tried usedas_strided:RFC: SciPy array types & libraries support scipy/scipy#18286 (comment)

Previous comments

I'll copy some of the key comments below to have them in a single place:

From@jpivarski:The thing I was most surprised about in the read-through was that "strides" are not considered part of the array object. I recognize that some libraries (e.g. JAX) are able to do great things by not allowing strides to be free. But then, you show in the SciPy benchmark (Fig. 2c, 2d) that not having this handle implies a severe performance penalty, which is likely to be a major take-away from this paper by most readers. In the discussion, you say that it illustrates that users will need to apply library-specific optimizations after all. (Implementing a backend or runtime switching system is non-goal number 2.) It seems to me like strides-awareness would be a good candidate for one of the optional extensions, if extensions are not strictly sets of functions. Perhaps that's a question or suggestion I should take up with the Consortium's normal channels, not a review of the paper, but it's a question the paper leads me to have.

From@leofang:Regarding strides: I have conflicting views and will continue to self-debate. On the one hand, when over half of the array libraries (Jax, TF, Dask, cuNumeric) do not offer strides for various reasons (ex: it makes no sense for distributed arrays), it's my opinion that even making it optional is not the right way to go, just like many other design considerations that we've made. Also, for those supporting the stride semantics, their strides are accessible via DLPack, it's just that currently it's hard to access them from within pure Python. On the other hand, as a low-level C++ library developer (in short, we write backends for the array libraries), I do see the pain point that it's hard for us to even imagine how our library would be adopted and consumed by the array libraries that do not have strides. It could be possible that our library simply cannot serve as the battery behind them due to different design philosophies, idk. Though, to be fair, the standard aims at the array libraries, not their backends, so if there's a standardization effort targeting array backends, it might be a better place for standardizing strides.

From@tylerjereddy:scipy.signal.welch seemed to be a well-behaved target for prototyping based on that so I tried doing that from scratch on a feature branch locally, progressively allowing more welch() tests to accept CuPy arrays when an env variable is set.
Even for this "well-behaved"/reasonable target, I ran into problems pretty early on. For example, both in the original feature branch and in my branch, there doesn't seem to be an elegant solution made for handling numpy.lib.stride_tricks.as_strided. The docs for that function don't even recommend using it, and yet CuPy (and apparently torch from the Quansight example) do provide it, outside of the array API standard proper.
So, I guess my first real-world experience makes me wonder what our policy on special casing in these scenarios will be--ideally, I'd like to just remove the usage of as_strided() and substitute with some other approach that doesn't require conditionals on the exact array type/namespace. While this is a rather specific blocker, if I encounter something like this even for a "well behaved" case, I can imagine quite a few headaches for the cases that are not well behaved.

From@rgommers:Let's have a look at this case - this is the code in question:

# Created strided array of data segmentsifnperseg==1andnoverlap==0:result=x[...,np.newaxis]else:# https://stackoverflow.com/a/5568169step=nperseg-noverlapshape=x.shape[:-1]+((x.shape[-1]-noverlap)//step,nperseg)strides=x.strides[:-1]+(step*x.strides[-1],x.strides[-1])result=np.lib.stride_tricks.as_strided(x,shape=shape,strides=strides)

Theas_strides usage is there to save memory. It's actually pretty difficult to understand what the code does exactly though, so it's good for performance but not great for maintainability. <.... example with more details cut, seehere>At that point, we can say we're happy with more readability at the cost of some performance (TBD if the for-loop matters, my guess is it will). Or, we just keep the special-casing for numpy usingas_strided. At that point, we do have two code paths, but no performance loss and also easier to understand code.

To discuss

It's clear that strides cannot be universally supported, and equally clear that users who write algorithmic code may need to access strides in some cases to avoid significant performance loss.

I think essentially, dealing with strides by hand for libraries that allow that is doing the work that in a more full-featured array library will/should be done by a compiler (e.g. the JIT/AOT compilers in JAX or PyTorch).

We need to at least suggest and document some strategy for cases like theas_strided usage above. In the similar but more specifc case oforder=, which mattered quite a bit performance-wise for scikit-learn, they ended up writing a library-specific utility function to use theorder keyword when present, and avoid it otherwise:https://github.com/scikit-learn/scikit-learn/blob/21312644df0a6b4c6f3c27a74ac9d26cf49c2304/sklearn/utils/_array_api.py#L360

You must be logged in to vote

Replies: 8 comments

Comment options

rgommers
Jun 15, 2023
Maintainer Author

Having a look at the usage frequency ofas_strided and.strides in Python code:

Scikit-learn:

$ rg strides --type pysklearn/datasets/tests/test_samples_generator.py135:            signs = signs.view(dtype="|S{0}".format(signs.strides[0]))sklearn/feature_extraction/image.py284:"""Extracts patches of any n-dimensional array in place using strides.326:    patch_strides = arr.strides329:    indexing_strides = arr[slices].strides336:    strides = tuple(list(indexing_strides) + list(patch_strides))338:    patches = as_strided(arr, shape=shape, strides=strides)$ rg as_strided --type pysklearn/feature_extraction/image.py16:from numpy.lib.stride_tricks import as_strided338:    patches = as_strided(arr, shape=shape, strides=strides)

SciPy:

$ rg as_strided --type pysignal/_spectral_py.py1958:        result = np.lib.stride_tricks.as_strided(x, shape=shape,linalg/_special_matrices.py5:from numpy.lib.stride_tricks import as_strided218:return as_strided(vals[len(c)-1:], shape=out_shp, strides=(-n,n)).copy()259:return as_strided(c_ext[L-1:], shape=(L, L), strides=(-n,n)).copy()316:return as_strided(vals, shape=out_shp, strides=(n,n)).copy()signal/_signaltools.py2124:                zi = np.lib.stride_tricks.as_strided(zi, expected_shape,signal/tests/test_signaltools.py445:        a = np.lib.stride_tricks.as_strided(a, shape=(n, 1000), strides=(8008, 8))linalg/special_matrices.py12:'fiedler','fiedler_companion','convolution_matrix','as_strided'interpolate/tests/test_fitpack.py347:        from numpy.lib.stride_tricks import as_strided354:        x =as_strided(np.zeros(()), shape=(size,))

There's some more usages of.strides in SciPy, but not that many either. Cleaned up by hand because the code searches returned comments and.py files that are only used to generate Cython code:

$ rg strides --type pysignal/_signaltools.py2110:                strides = zi.ndim * [None]2115:                        strides[k] = zi.strides[k]2117:                        strides[k] = zi.strides[k]2119:                        strides[k] = 02125:                                                     strides)linalg/_special_matrices.py217:    n = vals.strides[0]218:    return as_strided(vals[len(c)-1:], shape=out_shp, strides=(-n, n)).copy()258:    n = c_ext.strides[0]259:    return as_strided(c_ext[L-1:], shape=(L, L), strides=(-n, n)).copy()315:    n = vals.strides[0]316:    return as_strided(vals, shape=out_shp, strides=(n, n)).copy()signal/_spectral_py.py779:                # Sxx has one additional dimension for time strides1957:        strides = x.strides[:-1]+(step*x.strides[-1], x.strides[-1])1959:                                                 strides=strides)signal/tests/test_signaltools.py445:        a = np.lib.stride_tricks.as_strided(a, shape=(n, 1000), strides=(8008, 8))1123:        # numpy arrays that have odd strides. The stride value below gets1127:        a.strides = 16interpolate/_ndbspline.py189:        _strides_c1 = np.asarray([s // c1.dtype.itemsize190:                                  for s in c1.strides], dtype=np.intp)202:                                 _strides_c1,spatial/tests/test_kdtree.py1296:    assert query_discontiguous.strides[-1] != query_contiguous.strides[-1]1297:    assert d_discontiguous.strides[-1] != d_contiguous.strides[-1]spatial/transform/tests/test_rotation.py1473:    assert all(i >= 0 for i in e1.strides)1474:    assert all(i >= 0 for i in e2.strides)

You must be logged in to vote

0 replies

Comment options

rgommers
Jun 15, 2023
Maintainer Author

Interestingly enough the second of theas_strided feature requests to JAX was left open and they're willing to implement it:jax#11354. Quoting:"But instead of closing the request likejax#3171, I think we should consider adding anas_strided API, with clear warnings about the lack of operational semantics guarantees. It may be a lower priority item though..."

It may be insightful to take thescipy.signal.welch for-loop implementation above, and see if JAX can actually optimize it to give similar (or even better) performance to the NumPyas_strided equivalent.

You must be logged in to vote

0 replies

Comment options

jpivarski
Jun 16, 2023

Strides also come up a lot in broadcast implementations:

>>>a=np.arange(2*3*5).reshape(2,3,5)>>>b=np.array(100)>>>aprime,bprime=np.broadcast_arrays(a,b)>>>bprime.shape(2,3,5)>>>bprime.strides(0,0,0)

I'm not sure if this is relevant (because it's an implementation), but in case it is, this is another big way that non-trivial strides enter into array libraries without JIT-compilation/operator fusing.

You must be logged in to vote

0 replies

Comment options

leofang
Jun 17, 2023
Collaborator

Strides also come up a lot in broadcast implementations:
...
I'm not sure if this is relevant (because it's an implementation), ...

I don't think so, broadcasting is a separate topic from strides. Only the array shape is relevant to determine broadcasting. For strided array libraries, they do have certain behavior regarding the strides for broadcast arrays, as you pointed out. But, all array libraries, strided or not, can support broadcasting, and thus it's already included in the standard.

FYI this issue has been discussed in the array API meeting this week 🙂 Someone will write up a summary later, so I wouldn't dare attempt to give a poor summary in my own words.

But I do just come up with another reason against standardizing strides. Strides are meaningful only for dense arrays/tensors. Suppose I wanna write an multidimensional array library that is array-API compliant and has dual dense/sparse support under the hood, without users making any decision if an array should be stored in the dense or any of the known sparse format (COO, CSR, ..., it gets complicated for N-D tensors). Then, clearly I simply cannot return strides, because it is an implementation detail of my library that most users should not need to know/rely on.

You must be logged in to vote

0 replies

Comment options

rgommers
Jun 17, 2023
Maintainer Author

I agree with what@leofang says. Everything up to(2, 3, 5) will work for any conforming library, and is unrelated to striding. Only this part is specific:

>>> bprime.strides(0, 0, 0)

and it's very rarely needed to explicitly access or manipulate strides.

You must be logged in to vote

0 replies

Comment options

seberg
Jun 19, 2023

I suspect all of those uses of strides can be written usingsliding_window_view; some asbroadcast_to. Which probably should be more visible anyway.

In either case, the point is thatsliding_window_view promises a view (no copy), but otherwise doesn't refer to strides and doesn't need to. Maybe the no copy promise doesn't need to exist (unspecified), otherwise the only thought I would have is anapply_... function; I am not exactly sure how much it would help.

Memory order can be a useful to ensure or hint (it even makes some sense for sparse as CSC and CSR are basically the same thing as F vs. C).
Allowingorder= orpreferred_order= or so seems fine, but you will probably want that only as part ofarr.__array_namespace__() and not as part of the end-user API? Unless you "hide" it a bit?

You must be logged in to vote

0 replies

Comment options

jpivarski
Jun 19, 2023

I like this idea—collecting the use-cases that strides have been used for and defining APIs for them instead of the strides themselves. (With 18 years of NumPy and even more for the predecessors, we must have seen all the use-cases by now.)

This originally came up in the context of the paper for SciPy—indicating this as a possible direction would satisfy the needs of the paper. (Maybe after more of the group signs off on it—or maybe you're already in agreement on this and I'm just slow to catch on.)

You must be logged in to vote

0 replies

Comment options

asmeurer
Jun 23, 2023
Maintainer

FYI this issue has been discussed in the array API meeting this week 🙂 Someone will write up a summary later, so I wouldn't dare attempt to give a poor summary in my own words.

Not a summary, but one point I made there and I should probably mention here too, is that for the SciPywelch implementation, which is the benchmark we included in the paper, the difference between the strides-optimized and the non-strides, purely standard portable version was literally four orders of magnitude (0.039 seconds vs. 124 seconds on PyTorch GPU). Of course, for practicality, SciPy will keep the fast path for NumPy, CuPy, and PyTorch, but that doesn't help if some other library comes along that it doesn't know about.

You must be logged in to vote

0 replies

Movatterモバイル変換

Need for dealing with strides when writing performant code against multiple array libraries#782

Uh oh!

rgommersJun 15, 2023 Maintainer

Previous comments

To discuss

Replies: 8 comments

Uh oh!

rgommersJun 15, 2023 Maintainer Author

Uh oh!

Uh oh!

rgommersJun 15, 2023 Maintainer Author

Uh oh!

jpivarskiJun 16, 2023

Uh oh!

Uh oh!

leofangJun 17, 2023 Collaborator

Uh oh!

rgommersJun 17, 2023 Maintainer Author

Uh oh!

sebergJun 19, 2023

Uh oh!

jpivarskiJun 19, 2023

Uh oh!

asmeurerJun 23, 2023 Maintainer

Uh oh!

rgommers
Jun 15, 2023
Maintainer

rgommers
Jun 15, 2023
Maintainer Author

rgommers
Jun 15, 2023
Maintainer Author

jpivarski
Jun 16, 2023

leofang
Jun 17, 2023
Collaborator

rgommers
Jun 17, 2023
Maintainer Author

seberg
Jun 19, 2023

jpivarski
Jun 19, 2023

asmeurer
Jun 23, 2023
Maintainer