Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

MAINT,ENH: Reorganize buffered iteration setup#27883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
mattip merged 23 commits intonumpy:mainfromseberg:reduce-buf-refactor
Dec 17, 2024

Conversation

@seberg
Copy link
Member

@sebergseberg commentedDec 1, 2024
edited
Loading

This changes the buffered iteration setup to do most of the work initially rather than when actually copying buffers.

It has a few changes:

  • As of now, a very slightly higher overhead initially.
  • Shrinks the buffersize to fit the dimensions. This means that buffers can be re-used far more often for broadcast operations.
    (To be fair, this is most important with reductions where the buffersize always had to be shrunk down.)
  • The buffer re-use logic is now passably understandable, I hope. Which means we may also be able to optimize it further (e.g. allow re-use for dtypes with references, which currently fails...).
  • Tries to guess the whether using buffers is actually worthwhile. And example is:
    a = np.ones((4, 500))[::2]%timeit a + a  # 840ns vs. 1.15
    so not sure it is important, could maybe safe on it and just grow as much as possible.
    (A possible new optimization: Use "reduce" loop even if not a reduce.)

It still needs:

  • Figure out why CI breaks down when local run did not.
  • (largely ok) A bit of cleanup (e.g. unnecessary moving around). The main function is long and a bit messy, I admit. But a second pair of eyes is probably better at suggesting improvements if they are worthwhile.
  • Someone will probably want some good speedup examples. They exist, but I need to nail down the nicest ones.
  • Decide what to do aboutNPY_ITER_CONTIG.

@sebergseberg marked this pull request as draftDecember 1, 2024 22:17
@seberg
Copy link
MemberAuthor

seberg commentedDec 2, 2024
edited
Loading

A, maybe not too impressive, example for reduce per-iteration overhead is:

a = np.ones((100000, 2, 2))%timeit a.sum(1)  # 3.55ms vs. 3.9ms

In this case, we are only benefitting from figuring out sizes up-front rather than every time. It should be possible to reduce the overhead here a lot more: The old code could not optimize the bufferediternext() for advancing the index (it could have done so for reduce ones, though).
The new code, now always knows exactly which dimension to advance by how much (unless acoreoffset is set, i.e. effectively always). So it can just advance one dimension (plus the next dimensions each by one if there was an overflow).
Currently, it needs to figure out the full thing with divisions for each dimension.

FWIW. some of the strided benchmarks seem to profit, but I am not exactly sure why... I'll run the full benchmark suite, although I am not too optimistic there will good examples.

Copy link
MemberAuthor

@sebergseberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A few comments and some applies because looking at in a different layout is helpful.

There are some too short comments, but this is ready for review, I think.

There is little point in reviewing in-line for thenditer_api.c. It shows that I tried to retain some things, but of course the whole point of this exercise is to effectively rewrite the copy to/from buffers.

* Whether the iteration could be done with no buffering.
*
* Note that the iterator may use buffering to increase the inner loop size
* even when buffering is not required.
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This was always true of course. Previously, it was even always the case unless a reduction prevented it.

npy_intpfact=NIT_ITERINDEX(iter) /NIT_BUFFERDATA(iter)->coresize;
npy_intpoffset=NIT_ITERINDEX(iter)-fact*NIT_BUFFERDATA(iter)->coresize;
NIT_BUFFERDATA(iter)->coreoffset=offset;
}
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could be integrated into the loop. The bigger improvement here would be to avoid this function entirely initernext(). Which the code does for unbuffered iterators, but not for buffered ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For a next iteration, I think!

NPY_IT_DBG_PRINT1(
"Iterator: clearing refs of operand %d\n", (int)iop);
npy_intpbuf_stride=dtypes[iop]->elsize;
// TODO: transfersize is too large for reductions
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not a change, not a catastrophe. But this always clears the buffer as if it was completely filled (even if there may only be a single item).
I guess, calculating it shouldn't be too hard, we haveNBF_STRIDES() andNBF_OUTERSTRIDES() to work with. (If either stride is 0, the core-size or the outer-size do not matter).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let's leave that for a future enhancement

* NOTE: This is technically not true since we may still use
* an outer reduce at this point.
* So it prefers a non-reduce setup, which seems not
* ideal, but OK.
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmmmm, maybe theif can actually just useop_single_stride_dims[iop] == idim - 1?

That would require the assumption of using reduce logic below (and potentially disabling when it is not necessary). Probably that just works, maybe even be easier to grok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm still trying to grok this bit, but could one just swap the two pieces?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this might work, but not sure if it should matter right now (should make an issue probably that this all should be tweaked)...

The point is, if we use a "reduce" setup (i.e. two strides), then we can do one more dimension without buffering.

And maybe/probably it works to change this if and move it to the beginning (before the check for identical strides).
(I have to think about it myself.)

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Talking with Matti today so hopefully we can push it through (with some notes in an issue). So a bit of a note to self: Changing it here is simple, I think. The difference will be thatusing_reduce needs to be discovered below by re-checking operands.
Reduce-loops are unfavorable since you need to call the inner-loop/iternext more often (but, you save on buffer filling). That trade-off may already be covered by preferring larger cores, but not sure.

This whole "finding the best" is very crude and I am sure it can be improved. Just also don't want to overcomplicate it and make it slow.

mhvk reacted with thumbs up emoji
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let's leave that for a further investigation/PR. Maybe we can open a follow up issue to track more cleanup and optimizations.

* If this operand is a reduction operand and this is not the
* outer_reduce_dim, then we must stop.
*/
if (op_itflags[iop]&NPY_OP_ITFLAG_REDUCE) {
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could also useITFLAG_WRITE, which may be a bit clearer, considering that we may disable the flag later.

(We don't need the flag to be set, but right now it gets set because the earlier code does input validation anyway.)

|REDUCE OuterSize: 10
|REDUCE OuterDim: 1
|BUFFER Reduce outersize: 10
|BUFFER OuterDim: 1
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suppose the reduce outersize is actually also always valid now. Which would be useful for advancing the iterator.
Should rename/reorganize, but would add churn elsewhere too probably.

sebergand others added10 commitsDecember 2, 2024 18:10
This reorganizes the loop to do a larger initial setup rather thanfiguring things out on every iteration.That shrinks the buffersize often to fit better with the actual size.We try to do that smartly, although, the logic is different.It also means that "reduce" is not necessarily used (although itis likely that before it was also effectively unused often).This is not generally faster, but in some cases it will not usebuffering unnecessarily and it should be much better at bufferre-use (at least with smaller fixes).In those cases things will be much faster.(As of writing, the setup time seems slightly slower, that is probablynot unexpected, since finding the best size is extra work and it wouldonly amortize if there are many iterations.)
The old code just did bad things.  But a too large ndim would nothave mattered, since the copy would have finished early anyway.
Also clarifies the comment, since it may be a bit confusing:The outer stride being zero means that there must be nonzeroinner strides (otherwise they are all zero and we don't use reducelogic).But the inner stride being zero is not meaningful.
@sebergseberg marked this pull request as ready for reviewDecember 3, 2024 10:49
@seberg
Copy link
MemberAuthor

I think this is ready for review.

  • The tests are actually surprisingly not so bad, because I wrotethis test many years ago when there was a buffer-reuse bug (and einsum covers a lot too).
  • Can probably do some improvements, but it is at the point where suggestions would be the best way to drive those.

There is one remaining question: This currently breaksNPY_ITER_CONTIG which requests an operand to be contiguous. Re-instating it shouldn't be hard, but considering that a github code search finds exactly 0 occurances outside of NumPy, and NumPy doesn't actually use it...
I do wonder if we should just raise an error if it is set saying that we could re-instate it if needed?

@seberg
Copy link
MemberAuthor

@mattip this code removes that weird thing thatarr.sum() has does multiple (unnecessary) calls to the inner-loop. I think that is fine because the pairwise summation is light-weight enough that we don't need a limit on the recursion depth?

@seberg
Copy link
MemberAuthor

seberg commentedDec 5, 2024
edited
Loading

FWIW, I have reinstante and added tests for theCONTIG flag. But, it is still exactly as buggy as it was before for now...

There was a use-after free error, not sure yet why (or if even related) so saving for posterity. Nvm, I had a iterator debug print, and I suspect pytest may have inspected the operand arrays which are invalid at the end of the test.

Copy link
Contributor

@mhvkmhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just comments on the long comment for now. Will try to get going on the rest later...

* The functino here finds an outer dimension and it's "core" to use that
* works with reductions.
* While iterating, we will fill the buffers making sure that we:
* - Never buffer beyond the outer dimension (optimize chance of re-use).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What would it mean to "buffer beyond the outer dimension"? For some length N, I can imagine buffering some smaller M, but why ever M>N? I'm probably misreading this!

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added two examples, hopefully they help (if not for this, then hopefully to show how to think about these iterations).

The outer dimension may be iterated in chunks (not fully), so if it is say 100, and we can fit 30 of them in at a time, you reach 90. Then we just buffer 10, not 30 again, because that would increment the dimension that comes next (with a different stride).

* - Never buffer beyond the outer dimension (optimize chance of re-use).
* - Never buffer beyond the end of the core (all but outer dimension)
* if the user manually moved the iterator to the middle of the core.
* (Not typical and if, should usually happen once.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Garbled sentence! In the larger context, doesn't it just mean that if the iterator is manually set, things have to be recalculated?

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Restructured the order and wording to make it clearer hopefully.

You always have to (re-)fill buffers when manually moved (or ranged).

The point is that I decided to just finish a single core (which could be very small). For reductions, you have to do that anyway. Otherwise, it just seems a bit simpler to not worry about what happens if you iterator 1.5 core-sizes (rather than N core-sizes).

* We don't want to figure out the complicated copies every time we fill buffers
* so here we want to find a the outer iteration dimension so that:
* - Its core size is <= the maximum buffersize if buffering is needed.
* - Allows for reductions to work (with or without reduce-mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

follow grammatical structure of first item:

- Reductions are possible (with or without reduce mode);- Iteration overhead is minimized. We ...

* - Allows for reductions to work (with or without reduce-mode)
* - Optimizes for iteration overhead. We estimate the total overhead with:
* `(1 + n_buffers) / size`.
* The `size` is the `min(core_size * outer_dim_size, buffersize)` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should this be:

`(1 + n_buffers) / size`, a guess of how often we fill buffers, with`size=min(core_size * outer_dim_size, buffersize)`.

But I think this is not very clear, this does not look like a number (it has 1/bytes as dimension). I think what one really is trying to minimize is,

`(1+n_buffers) * core_size * outer_dim_size / min(buffer_size, core_size * outer_dim_size)`.

In practice, since the total size is fixed, what you write does the same thing, but maybe it is clearer to write like I have it? I.e., how about the following for the whole entry:

- Iteration overhead is minimized. For this, we use the number of times buffers  are filled, `(1 + n_buffers) * total_size / min(total_size, buffer_size)`  (where `total_size = core_size * outer_dim_size`).

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK,total_size is not actuallycore_size * outer_dim_size. Since there may be more dimensions (outer here is thefirst "outer" dimension in a sense).

(I added the "first" in a few places, maybe it helps... It would be nice to find a different name because "outer" is also a nice name for everything that isn't buffered, but for the "outer" dimension here it is just the outermost that is still iterated as a part of a buffer.
But, I can't think of one right now.)

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(completely changed/expanded the formulas, though...)

Copy link
Contributor

@mhvkmhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK, a few more code comments, though none remotely serious. I cannot say that I really understand the code, so at some level am relying on tests passing. I do think your refactoring makes the code easier to follow (but I haven't really tried to understand the whole...).

NpyIter_BufferData*bufferdata=NIT_BUFFERDATA(iter);

/*
* We check two things here, first whether the core is single strided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thesingle strided is a bit confusing; from the code, below, I'd call itcontiguous. Maybe change that? Otherwise, just add "(C contiguous)" here?

EDIT: or rather, I think this does not have to be contiguous at all, just consistent with the first stride. So, maybe add "(note that the stride of the first axis does not have to equal the element size; if it does, this is equivalent to the operand being C contiguous)".

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Rephrased to say "first how many operand dimensions can be iterated using a single stride (all dimensions are consistent),"

Contiguouity of course doesn't matter (it only matters for that slightly awkward "contiguous" flag.)

/* For clarity: op_reduce_outer_dim[iop] if set always matches. */
assert(!op_reduce_outer_dim[iop]||op_reduce_outer_dim[iop]==outer_reduce_dim);
}
if (iop!=nop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How can this be hit? I don't see abreak in thefor loop above...

seberg reacted with thumbs up emoji
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just forgot to delete this. The first version I had broke immediately on finding a reduce op. But I wanted the info for all operands, so now it's breaking later.

* NOTE: This is technically not true since we may still use
* an outer reduce at this point.
* So it prefers a non-reduce setup, which seems not
* ideal, but OK.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm still trying to grok this bit, but could one just swap the two pieces?

npy_intpfact=NIT_ITERINDEX(iter) /NIT_BUFFERDATA(iter)->coresize;
npy_intpoffset=NIT_ITERINDEX(iter)-fact*NIT_BUFFERDATA(iter)->coresize;
NIT_BUFFERDATA(iter)->coreoffset=offset;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For a next iteration, I think!

Copy link
Contributor

@mhvkmhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A few more small things, and one possible little buglet. I still do not feel I really understand everything. Best would be if someone else could have a look, but, if not, my sense would be to put this in, and continue in smaller pieces...

reduce_outeraxisdata=NIT_INDEX_AXISDATA(axisdata,reduce_outerdim);
if (itflags&NPY_ITFLAG_REDUCE) {
npy_intpsizeof_axisdata=NIT_AXISDATA_SIZEOF(itflags,ndim,nop);
outer_axisdata=NIT_INDEX_AXISDATA(axisdata,bufferdata->outerdim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not your problem, but, really, having macros require other variables to be defined is pretty bad form! Especially since hereaxisdata_incr could have made use ofsizeof_axisdata.

/*
* Note that this code requires/uses the fact that there is always one
* axisdata even for ndim == 0 (i.e. no need to worry about it).
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Again, only if you are making changes, maybe add:

 * Also, it uses that the operands have already been broadcast against * each other (and thus may have zero strides).

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Dont' feel it fits here (also moved this comment down tobest_ndim = 0.

}
}
NIT_BUFFERDATA(iter)->buffersize=best_size;
/* Core size is 0 (unless the user applies a range explicitly). */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

type? "Core starts at 0 (unless..."

seberg reacted with thumbs up emoji
else {
/* If there's no buffering, the stridesare always fixed */
/* If there's no buffering, the stridescome from the operands. */
memcpy(out_strides,NAD_STRIDES(axisdata0),nop*NPY_SIZEOF_INTP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A future cleanup might eliminate any internal calls to this API function and use the iter data directly.

Copy link
Member

@mattipmattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As far as I understood, it looks good. Might be nice to check with valgrind at some point before the next release.

@mattip
Copy link
Member

There are some follow ups, but nothing really jumps out as a possible continuation. Thanks@seberg

@mattipmattip merged commitf36fa72 intonumpy:mainDec 17, 2024
65 of 66 checks passed
@sebergseberg deleted the reduce-buf-refactor branchDecember 17, 2024 09:46
@lesteve
Copy link
Contributor

Somehow we are seeing some failures in scikit-learn against numpy-dev and a quick bisect seems to indicate this PR is the reason. Suggestions more than welcome on this! Seescikit-learn/scikit-learn#30509 for more details.

@seberg
Copy link
MemberAuthor

It is possible, the buffering changed, which may change things mainly for:

  • float16, in case buffer sizes are smaller, because intermittent result may be cast back to float16 more often.
  • float32 summation is more likely to change because iteration sizes increased. That will change the pair-wise summation logic (I would think slightly for the better).

(I was planning on adding a release note.)

@lesteve
Copy link
Contributor

lesteve commentedDec 19, 2024
edited
Loading

  • float32 summation is more likely to change because iteration sizes increased. That will change the pair-wise summation logic (I would think slightly for the better).

The issue does happen with float32. We are comparing our own optimized version of pairwise_distances toscipy.spatial.cdiston float64 (no both are float32, my mistake). Before the test pass withrtol=1e-5 now the relative difference is2.7e-4 so it kind of looks a bit off. There is also some optimized Cython code involved which I am not very familiar with so it could be partly due to our Cython code as well, hard to tell at this point.

I do realize that it's quite far away from a minimal reproducer ... I could try to spend time on creating one/debugging further, if you think this is useful.

scikit-learn/scikit-learn#30509 (comment) has more details just in case.

@seberg
Copy link
MemberAuthor

I am taking a look. Somewhere in the code, there must be a summation along the dimension, do you know where that happens?

Of course both directions of change are possible, in this instance, my current suspicion is that you are using a NumPy sum, which is then precise for:

dim = 1000000leftover = dim % 8196  # 88 elementssummation_change = diff[:-88].sum() + diff[-88:].sum - diff.sum()

Wheresummation_change approximates the difference under the assumption that the precision is always better for the pairwise summation itself (I am not sure this is the case!).

But parts of the code also looks like it goes tofloat64 intermediately possibly, and with that I don't see this happening.

Or in other words/example:

eps = 1e-4dim = 1000000arr = np.array([np.float32(1 + eps) - 1] * dim)  # 1 + eps - 1 isn't eps# NumPy 2.1.3:(arr**2).sum() - arr[0]**2*dim# np.float32(-1.1175871e-08)# NumPy main:(arr**2).sum() - arr[0]**2*dim# np.float32(1.8626451e-09)

And to complete the reasoning above on 2.1.3:

In [18]: (arr[:-88]**2).sum() + (arr[-88:]**2).sum() - (arr**2).sum()Out[18]: np.float32(0.0)

On main:

In [66]: (arr[:-88]**2).sum() + (arr[-88:]**2).sum() - (arr**2).sum()Out[66]: np.float32(-9.313226e-10)

@lesteve
Copy link
Contributor

lesteve commentedDec 19, 2024
edited
Loading

Thanks for having a look!

Somewhere in the code, there must be a summation along the dimension, do you know where that happens?

Not very familiar with this code unfortunately, maybe@jjerphan or@jeremiedbb remember more details. Seescikit-learn/scikit-learn#30509 (comment) for more context and the original failure on scikit-learn when running against numpy-dev.

jjerphan reacted with eyes emoji

@lesteve
Copy link
Contributor

Somewhere in the code, there must be a summation along the dimension, do you know where that happens?

Actually, complete guess, but maybe inrow_norms which does the equivalent ofnp.sqrt((X * X).sum(axis=1)) with anp.einsum?

https://github.com/scikit-learn/scikit-learn/blob/4ad187a7401f939c4d9cd27090c4258b5d810650/sklearn/utils/extmath.py#L47-L82

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@mhvkmhvkmhvk left review comments

@mattipmattipmattip approved these changes

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@seberg@mattip@lesteve@mhvk

[8]ページ先頭

©2009-2025 Movatter.jp