- Notifications
You must be signed in to change notification settings - Fork54
APIs for discovering array structure as pertains to linear algebra operations#783
-
I would like to move the recent discussion from SciPyscipy/scipy#19068 to here for a more specialized treatment. In the current incarnation of linalg functions, the most important distinction for dispatching to different low level routines are the array structures. Eigenvalues, SVD, linear solve, cholesky, LU, QR and so on all benefit from additional structure to gain performance and reduce the workload. To that end, typically a quite permissive, fast, and more importantly, with quick-exit hatches, function is implemented to do the early O(n) investigation to discover non-finite entries, triangular, hessenberg, banded structures to report back to the dispatcher. These type of algorithms typically go by the name of polyalgorithms and becoming quite standard. The most well-known is arguably the backslash polyalgorithm of matlab that selects the right low level implementation automatically when user provides There is a stalled (mostly due to missing infrastructure and lack of maintainer time) attempt in SciPy regarding this (scipy/scipy#12824 for more details). However in this issue I'd like to point out that these discovery functions are missing from the array API which is going to be relevant for any library that is going to be implementing linalg functionality. The naming can be changed but the missing functionality is mainly Most of SciPy functions are crippled in terms of performance due to lack of the last item since Hence either low-level underscored or exposed to public API these are the functionalities over SciPy we are hoping to get at in the near future. Moreover as we reduce our fortran codebase most of these will be carried over to compiled code for even more reduced overhead. I'd like to get some feedback about what the array vendors think about this. In fact if this comes out of the box from array vendors that would in fact be much better. Since I did not see any work towards these issues, I wanted to bring the discussion here. |
BetaWas this translation helpful?Give feedback.
All reactions
Replies: 8 comments
-
Thanks for your thoughts@ilayn! I think there are two separate topics here: matrix structure, and
There is an |
BetaWas this translation helpful?Give feedback.
All reactions
-
Efficient structure discovery is interesting. The first question I have is how well-defined that topic is. The utilities are easy to implement in principle perhaps, e.g.: defissymmetric(x):returnall(x==matrix_transpose(x)) which I believe is all there is to it for exact symmetry. So to make sure: is that always what you want here, or would one need some inexact approximation to symmetry where if all elements are equal to within a certain tolerance, one chooses a symmetric solver routine? That's what is being done now in SciPy:https://github.com/scipy/scipy/blob/840a9ed0f41f84ce0bbf8c104bfbbbe93781cbdd/scipy/linalg/_cythonized_array_utils.pyx#L239-L323 It seems a little tricky to standardize that, because I imagine it's more an internal routine than something that NumPy, PyTorch et al. would want to expose directly to their end users. If so, perhaps what is needed is more a library-agnostic implementation of this function? For your purposes, it's something SciPy should be able to easily vendor, and it should be an implementation that's guaranteed to work well with PyTorch, CuPy, JAX, et al. Correct? |
BetaWas this translation helpful?Give feedback.
All reactions
-
Indeed, if we have these per array vendor implementations for such operations, then probably SciPy won't be needing all that home-baked Cythonized array utilities to shave off some microseconds. Then we can rely on performant check routines of the arrays out-of-the box in a unified fashion.
Yes but returns an array, not a boolean saying "there is something in there not finite". So for large arrays this is one of the issues |
BetaWas this translation helpful?Give feedback.
All reactions
-
Ah yes, this is one of the examples I have. So imagine you have a 5000x5000 array and the [3, 2] element is not equal to [2, 3]. This test goes all the way to scan 5000**2 entries although it is obvious that it will return false in the end. Hence this is what I mean with quick-exit hatches. These are occasionally costing the same time the actual algorithm will take. So we start injecting custom compiled codes in SciPy . If we want to use less compiled code, array API should actually provide similar performant routines. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Ah, okay - got it. Basically, we're missing a fused |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Yes exactly here is what we are doing currently https://github.com/scipy/scipy/blob/main/scipy/linalg/_cythonized_array_utils.pyx#L239-L365 |
BetaWas this translation helpful?Give feedback.
All reactions
-
I had a closer look at this: >>>x=np.zeros((100,1000),dtype=np.float64)>>>%timeitnp.all(np.isfinite(x))24.7µs ±92.8nsperloop (mean ±std.dev.of7runs,10,000loopseach)>>>%timeitnp.isfinite(x)17.4µs ±164nsperloop (mean ±std.dev.of7runs,100,000loopseach)>>>x[2,2]=np.inf>>>%timeitnp.isfinite(x)17.3µs ±90.3nsperloop (mean ±std.dev.of7runs,100,000loopseach) So Use of >>>%timeitnp.allclose(x,0.0)2.92ms ±7.55µsperloop (mean ±std.dev.of7runs,100loopseach) My feeling is that a gain of ~2x for fused functions is a nice-to-have, but not low-hanging fruit. Things like |
BetaWas this translation helpful?Give feedback.
All reactions
-
Ah I should clarify, the np.allclose is used only if the user wants an inexact comparison, that is if things can be off a little bit in the lower and upper part (hence the check for if atol or rtol is provided). In the actual tests it actually goes through the array with double loop. Nice comparison anyways, here is my results for the code you provided at the top just to give sense of timing of my machine >>>x=np.zeros((100,1000),dtype=np.float64)>>>%timeitnp.all(np.isfinite(x))>>>%timeitnp.isfinite(x)>>>x[2,2]=np.inf>>>%timeitnp.isfinite(x)17.9µs ±316nsperloop (mean ±std.dev.of7runs,100000loopseach)13.9µs ±320nsperloop (mean ±std.dev.of7runs,100000loopseach)13.7µs ±343nsperloop (mean ±std.dev.of7runs,100000loopseach) I am going to use the symmetricity test since that we have already implemented in scipy.linalg >>>A=np.random.rand(100,100)+50*np.eye(100)>>>A=A+A.T>>>%timeitnp.all(A==A.T)11µs ±23.2nsperloop (mean ±std.dev.of7runs,100000loopseach)>>>%timeitla.issymmetric(A)7.07µs ±18.1nsperloop (mean ±std.dev.of7runs,100000loopseac OK some benefits, nothing to make noise about. But now I'll change the [1,2] element >>>A[1,2]=0>>>%timeitla.issymmetric(A)3.36µs ±18.9nsperloop (mean ±std.dev.of7runs,100000loopseach)>>>%timeitnp.all(A==A.T)11µs ±22.8nsperloop (mean ±std.dev.of7runs,100000loopseach) So here even though our scipy implementation is not even close to cache-friendliness, still gets 4x and that is mostly because we Cythonize the code and not really using NumPy C access speed which will take this easily to 10x levels. The trick is that it stops the test the moment it fails and does not carry all the way to the end. In other words, one evidence is enough. Similar things can be done with isfinite, nonzero structure etc. with similar quick returns. When linalg functions start checking the arguments these small costs start to pile up and cause a lot of performance even if we manage to find to circumvent f2py and link directly to OpenBLAS etc. |
BetaWas this translation helpful?Give feedback.
All reactions
This discussion was converted from issue #675 on April 04, 2024 08:32.