data-apis/array-apiPublic

NotificationsYou must be signed in to change notification settings
Fork54
Star262

APIs for discovering array structure as pertains to linear algebra operations#783

ilayn started this conversation inRelated topics

ilayn

Aug 15, 2023

· 8 comments

Return to top

Discussion options

ilayn
Aug 15, 2023

I would like to move the recent discussion from SciPyscipy/scipy#19068 to here for a more specialized treatment.

In the current incarnation of linalg functions, the most important distinction for dispatching to different low level routines are the array structures. Eigenvalues, SVD, linear solve, cholesky, LU, QR and so on all benefit from additional structure to gain performance and reduce the workload.

To that end, typically a quite permissive, fast, and more importantly, with quick-exit hatches, function is implemented to do the early O(n) investigation to discover non-finite entries, triangular, hessenberg, banded structures to report back to the dispatcher.

These type of algorithms typically go by the name of polyalgorithms and becoming quite standard. The most well-known is arguably the backslash polyalgorithm of matlab that selects the right low level implementation automatically when user providesA\b.

There is a stalled (mostly due to missing infrastructure and lack of maintainer time) attempt in SciPy regarding this (scipy/scipy#12824 for more details). However in this issue I'd like to point out that these discovery functions are missing from the array API which is going to be relevant for any library that is going to be implementing linalg functionality.

The naming can be changed but the missing functionality is mainly

scipy.linalg.bandwidthscipy.linalg.ishermitianscipy.linalg.issymmetricscipy.linalg.isfinite  # does not exist yet

Most of SciPy functions are crippled in terms of performance due to lack of the last item sincecheck_finite is a massive performance eater especially for sizes this array API is considering. Similarly, checking for all zeros and other details can be also pushed to native low level code for increased performance. The main discomfort we have as array consumers is that these functions are typically don't quick exit and consider all array entries, mostly usingnp.all(A == 0) type of shortcuts. This is pretty wasteful as the arrays get larger and provides an unconditional O(n**2) complexity even though probably the first entry is nonzero.

Hence either low-level underscored or exposed to public API these are the functionalities over SciPy we are hoping to get at in the near future. Moreover as we reduce our fortran codebase most of these will be carried over to compiled code for even more reduced overhead.

I'd like to get some feedback about what the array vendors think about this. In fact if this comes out of the box from array vendors that would in fact be much better. Since I did not see any work towards these issues, I wanted to bring the discussion here.

You must be logged in to vote

Replies: 8 comments

Comment options

rgommers
Aug 16, 2023
Maintainer

Thanks for your thoughts@ilayn! I think there are two separate topics here: matrix structure, andcheck_finite performance/overhead. I think the former is the main topic here. Let me first reply about the latter:

scipy.linalg.isfinite # does not exist yet
Most of SciPy functions are crippled in terms of performance due to lack of the last item sincecheck_finite is a massive performance eater especially for sizes this array API is considering

There is anisfinite function already:https://data-apis.org/array-api/latest/API_specification/generated/array_api.isfinite.html. This function isn'tlinalg-specific. That can be used by SciPy, so I think we're good there. For thelinalg extension in the array API standard, the one thing we should probably do is explicitly say that the behavior in the presence ofinf/nan values is undefined, because right now I think the topic isn't explicitly covered at all in the text.

You must be logged in to vote

0 replies

Comment options

rgommers
Aug 16, 2023
Maintainer

Efficient structure discovery is interesting. The first question I have is how well-defined that topic is. The utilities are easy to implement in principle perhaps, e.g.:

defissymmetric(x):returnall(x==matrix_transpose(x))

which I believe is all there is to it for exact symmetry. So to make sure: is that always what you want here, or would one need some inexact approximation to symmetry where if all elements are equal to within a certain tolerance, one chooses a symmetric solver routine? That's what is being done now in SciPy:https://github.com/scipy/scipy/blob/840a9ed0f41f84ce0bbf8c104bfbbbe93781cbdd/scipy/linalg/_cythonized_array_utils.pyx#L239-L323

It seems a little tricky to standardize that, because I imagine it's more an internal routine than something that NumPy, PyTorch et al. would want to expose directly to their end users. If so, perhaps what is needed is more a library-agnostic implementation of this function? For your purposes, it's something SciPy should be able to easily vendor, and it should be an implementation that's guaranteed to work well with PyTorch, CuPy, JAX, et al. Correct?

You must be logged in to vote

0 replies

Comment options

ilayn
Aug 16, 2023
Author

Indeed, if we have these per array vendor implementations for such operations, then probably SciPy won't be needing all that home-baked Cythonized array utilities to shave off some microseconds. Then we can rely on performant check routines of the arrays out-of-the box in a unified fashion.

There is an isfinite function already:

Yes but returns an array, not a boolean saying "there is something in there not finite". So for large arrays this is one of the issues

You must be logged in to vote

0 replies

Comment options

ilayn
Aug 16, 2023
Author

which I believe is all there is to it for exact symmetry.

Ah yes, this is one of the examples I have. So imagine you have a 5000x5000 array and the [3, 2] element is not equal to [2, 3]. This test goes all the way to scan 5000**2 entries although it is obvious that it will return false in the end. Hence this is what I mean with quick-exit hatches. These are occasionally costing the same time the actual algorithm will take. So we start injecting custom compiled codes in SciPy .

If we want to use less compiled code, array API should actually provide similar performant routines.

You must be logged in to vote

0 replies

Comment options

rgommers
Aug 16, 2023
Maintainer

Yes but returns an array, not a boolean saying "there is something in there not finite". So for large arrays this is one of the issues

Ah, okay - got it. Basically, we're missing a fusedall_isfinite or some such thing. In the regular case (all values are indeed finite), this will then still have to traverse the whole array, but it would avoid the overhead of creating and then checking the intermediate boolean array. This is a useful thing to do, I'd like to give some thought about how to best do it.

You must be logged in to vote

0 replies

Comment options

ilayn
Aug 16, 2023
Author

Yes exactly here is what we are doing currently

https://github.com/scipy/scipy/blob/main/scipy/linalg/_cythonized_array_utils.pyx#L239-L365

You must be logged in to vote

0 replies

Comment options

rgommers
Aug 18, 2023
Maintainer

I had a closer look at this:

>>>x=np.zeros((100,1000),dtype=np.float64)>>>%timeitnp.all(np.isfinite(x))24.7µs ±92.8nsperloop (mean ±std.dev.of7runs,10,000loopseach)>>>%timeitnp.isfinite(x)17.4µs ±164nsperloop (mean ±std.dev.of7runs,100,000loopseach)>>>x[2,2]=np.inf>>>%timeitnp.isfinite(x)17.3µs ±90.3nsperloop (mean ±std.dev.of7runs,100,000loopseach)

Soall(isfinite(x)) is fast, but does have a non-negligible cost. Anall_isfinite could be faster thanisfinite alone in case all values are finite (hard to check, but perhaps 50% faster?). That said, implementing it as, e.g., a loop in Cython code would not be faster probably - this SIMD implementation in NumPy sped upisfinite by 2x-3x. which is likely more than can be gained by writing a Cython loop I think:numpy/numpy#22165.

Use ofallclose for structure checks is ~100x slower, because of the naive pure Python implementation in numpy:

>>>%timeitnp.allclose(x,0.0)2.92ms ±7.55µsperloop (mean ±std.dev.of7runs,100loopseach)

My feeling is that a gain of ~2x for fused functions is a nice-to-have, but not low-hanging fruit. Things likenp.allclose, which is used in the_cythonized_array_utils.pyx linked to above, will be easier to speed up and by a much larger factor.

You must be logged in to vote

0 replies

Comment options

ilayn
Aug 18, 2023
Author

Ah I should clarify, the np.allclose is used only if the user wants an inexact comparison, that is if things can be off a little bit in the lower and upper part (hence the check for if atol or rtol is provided). In the actual tests it actually goes through the array with double loop.

Nice comparison anyways, here is my results for the code you provided at the top just to give sense of timing of my machine

>>>x=np.zeros((100,1000),dtype=np.float64)>>>%timeitnp.all(np.isfinite(x))>>>%timeitnp.isfinite(x)>>>x[2,2]=np.inf>>>%timeitnp.isfinite(x)17.9µs ±316nsperloop (mean ±std.dev.of7runs,100000loopseach)13.9µs ±320nsperloop (mean ±std.dev.of7runs,100000loopseach)13.7µs ±343nsperloop (mean ±std.dev.of7runs,100000loopseach)

I am going to use the symmetricity test since that we have already implemented in scipy.linalg

>>>A=np.random.rand(100,100)+50*np.eye(100)>>>A=A+A.T>>>%timeitnp.all(A==A.T)11µs ±23.2nsperloop (mean ±std.dev.of7runs,100000loopseach)>>>%timeitla.issymmetric(A)7.07µs ±18.1nsperloop (mean ±std.dev.of7runs,100000loopseac

OK some benefits, nothing to make noise about. But now I'll change the [1,2] element

>>>A[1,2]=0>>>%timeitla.issymmetric(A)3.36µs ±18.9nsperloop (mean ±std.dev.of7runs,100000loopseach)>>>%timeitnp.all(A==A.T)11µs ±22.8nsperloop (mean ±std.dev.of7runs,100000loopseach)

So here even though our scipy implementation is not even close to cache-friendliness, still gets 4x and that is mostly because we Cythonize the code and not really using NumPy C access speed which will take this easily to 10x levels. The trick is that it stops the test the moment it fails and does not carry all the way to the end. In other words, one evidence is enough. Similar things can be done with isfinite, nonzero structure etc. with similar quick returns.

When linalg functions start checking the arguments these small costs start to pile up and cause a lot of performance even if we manage to find to circumvent f2py and link directly to OpenBLAS etc.

You must be logged in to vote

0 replies

Movatterモバイル変換

APIs for discovering array structure as pertains to linear algebra operations#783

Uh oh!

ilaynAug 15, 2023

Replies: 8 comments

Uh oh!

rgommersAug 16, 2023 Maintainer

Uh oh!

rgommersAug 16, 2023 Maintainer

Uh oh!

ilaynAug 16, 2023 Author

Uh oh!

ilaynAug 16, 2023 Author

Uh oh!

Uh oh!

rgommersAug 16, 2023 Maintainer

Uh oh!

ilaynAug 16, 2023 Author

Uh oh!

rgommersAug 18, 2023 Maintainer

Uh oh!

ilaynAug 18, 2023 Author

Uh oh!

ilayn
Aug 15, 2023

rgommers
Aug 16, 2023
Maintainer

rgommers
Aug 16, 2023
Maintainer

ilayn
Aug 16, 2023
Author

ilayn
Aug 16, 2023
Author

rgommers
Aug 16, 2023
Maintainer

ilayn
Aug 16, 2023
Author

rgommers
Aug 18, 2023
Maintainer

ilayn
Aug 18, 2023
Author