Movatterモバイル変換

Copy link

MemberAuthor

seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A few comments and some applies because looking at in a different layout is helpful.

There are some too short comments, but this is ready for review, I think.

There is little point in reviewing in-line for thenditer_api.c. It shows that I tried to retain some things, but of course the whole point of this exercise is to effectively rewrite the copy to/from buffers.

numpy/_core/src/multiarray/nditer_api.c OutdatedShow resolvedHide resolved

		* Whether the iteration could be done with no buffering.
		*
		* Note that the iterator may use buffering to increase the inner loop size
		* even when buffering is not required.

Copy link

MemberAuthor

sebergDec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This was always true of course. Previously, it was even always the case unless a reduction prevented it.

		npy_intpfact=NIT_ITERINDEX(iter) /NIT_BUFFERDATA(iter)->coresize;
		npy_intpoffset=NIT_ITERINDEX(iter)-fact*NIT_BUFFERDATA(iter)->coresize;
		NIT_BUFFERDATA(iter)->coreoffset=offset;
		}

Copy link

MemberAuthor

sebergDec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could be integrated into the loop. The bigger improvement here would be to avoid this function entirely initernext(). Which the code does for unbuffered iterators, but not for buffered ones.

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For a next iteration, I think!

numpy/_core/src/multiarray/nditer_api.c OutdatedShow resolvedHide resolved

numpy/_core/src/multiarray/nditer_constr.c

		NPY_IT_DBG_PRINT1(
		"Iterator: clearing refs of operand %d\n", (int)iop);
		npy_intpbuf_stride=dtypes[iop]->elsize;
		// TODO: transfersize is too large for reductions

Copy link

MemberAuthor

sebergDec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not a change, not a catastrophe. But this always clears the buffer as if it was completely filled (even if there may only be a single item).
I guess, calculating it shouldn't be too hard, we haveNBF_STRIDES() andNBF_OUTERSTRIDES() to work with. (If either stride is 0, the core-size or the outer-size do not matter).

Copy link

Member

mattipDec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let's leave that for a future enhancement

		* NOTE: This is technically not true since we may still use
		* an outer reduce at this point.
		* So it prefers a non-reduce setup, which seems not
		* ideal, but OK.

Copy link

MemberAuthor

sebergDec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmmmm, maybe theif can actually just useop_single_stride_dims[iop] == idim - 1?

That would require the assumption of using reduce logic below (and potentially disabling when it is not necessary). Probably that just works, maybe even be easier to grok.

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm still trying to grok this bit, but could one just swap the two pieces?

Copy link

MemberAuthor

sebergDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this might work, but not sure if it should matter right now (should make an issue probably that this all should be tweaked)...

The point is, if we use a "reduce" setup (i.e. two strides), then we can do one more dimension without buffering.

And maybe/probably it works to change this if and move it to the beginning (before the check for identical strides).
(I have to think about it myself.)

Copy link

MemberAuthor

sebergDec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Talking with Matti today so hopefully we can push it through (with some notes in an issue). So a bit of a note to self: Changing it here is simple, I think. The difference will be thatusing_reduce needs to be discovered below by re-checking operands.
Reduce-loops are unfavorable since you need to call the inner-loop/iternext more often (but, you save on buffer filling). That trade-off may already be covered by preferring larger cores, but not sure.

This whole "finding the best" is very crude and I am sure it can be improved. Just also don't want to overcomplicate it and make it slow.

Copy link

Member

mattipDec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Let's leave that for a further investigation/PR. Maybe we can open a follow up issue to track more cleanup and optimizations.

numpy/_core/src/multiarray/nditer_constr.c Outdated

		* If this operand is a reduction operand and this is not the
		* outer_reduce_dim, then we must stop.
		*/
		if (op_itflags[iop]&NPY_OP_ITFLAG_REDUCE) {

Copy link

MemberAuthor

sebergDec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could also useITFLAG_WRITE, which may be a bit clearer, considering that we may disable the flag later.

(We don't need the flag to be set, but right now it gets set because the earlier code does input validation anyway.)

numpy/_core/src/multiarray/nditer_constr.c OutdatedShow resolvedHide resolved

numpy/_core/tests/test_nditer.py Outdated

		\|REDUCE OuterSize: 10
		\|REDUCE OuterDim: 1
		\|BUFFER Reduce outersize: 10
		\|BUFFER OuterDim: 1

Copy link

MemberAuthor

sebergDec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suppose the reduce outersize is actually also always valid now. Which would be useful for advancing the iterator.
Should rename/reorganize, but would add churn elsewhere too probably.

numpy/_core/tests/test_einsum.py OutdatedShow resolvedHide resolved

sebergand others added10 commits

December 2, 2024 18:10

MAINT: Refactor buffered loop to do an initial setup

8d69f05

This reorganizes the loop to do a larger initial setup rather thanfiguring things out on every iteration.That shrinks the buffersize often to fit better with the actual size.We try to do that smartly, although, the logic is different.It also means that "reduce" is not necessarily used (although itis likely that before it was also effectively unused often).This is not generally faster, but in some cases it will not usebuffering unnecessarily and it should be much better at bufferre-use (at least with smaller fixes).In those cases things will be much faster.(As of writing, the setup time seems slightly slower, that is probablynot unexpected, since finding the best size is extra work and it wouldonly amortize if there are many iterations.)

TST: Fixup tests to pass (for now not correct)

7d19f2d

BUG: Avoid a coresize of 0

809cf0c

TST: The nullpointer doesn't print uniformly accross platforms

9302b30

BUG: Add missing pointer dereference

54d967f

The old code just did bad things.  But a too large ndim would nothave mattered, since the copy would have finished early anyway.

BUG: Fix coreoffset mechanism and unset buf-reuse on different (parti…

87ad5a2

…al) copy

TST: Forgot one more nditer debug_print() fixup

67cddf7

Apply suggestions from code review

facd237

MAINT: Remove bufnever logic from allocate-arrays (we do it later any…

22f335c

…way)

MAINT: Rename to BUF_SINGLESTRIDE and use the new gap in the flags

ac282fc

seberg force-pushed thereduce-buf-refactor branch from6cc45b7 toac282fcCompare

December 2, 2024 17:15

seberg added5 commits

December 3, 2024 10:23

MAINT: Remove unnecessary fixed-stride logic

6ed3b99

MAINT: Fill in correct buffer transferstride always

162a951

Also clarifies the comment, since it may be a bit confusing:The outer stride being zero means that there must be nonzeroinner strides (otherwise they are all zero and we don't use reducelogic).But the inner stride being zero is not meaningful.

MAINT: Don't use static zero (initialize by caller instead)

0cc37b9

DOC: Improve the function documentation somewhat.

786e546

BUG: Fix typo in new assert

d09be24

seberg marked this pull request as ready for review

December 3, 2024 10:49

Copy link

MemberAuthor

seberg commentedDec 3, 2024

I think this is ready for review.

The tests are actually surprisingly not so bad, because I wrotethis test many years ago when there was a buffer-reuse bug (and einsum covers a lot too).
Can probably do some improvements, but it is at the point where suggestions would be the best way to drive those.

There is one remaining question: This currently breaksNPY_ITER_CONTIG which requests an operand to be contiguous. Re-instating it shouldn't be hard, but considering that a github code search finds exactly 0 occurances outside of NumPy, and NumPy doesn't actually use it...
I do wonder if we should just raise an error if it is set saying that we could re-instate it if needed?

Copy link

MemberAuthor

seberg commentedDec 3, 2024

@mattip this code removes that weird thing thatarr.sum() has does multiple (unnecessary) calls to the inner-loop. I think that is fine because the pairwise summation is light-weight enough that we don't need a limit on the recursion depth?

Copy link

MemberAuthor

seberg commentedDec 5, 2024•
edited
Loading

FWIW, I have reinstante and added tests for theCONTIG flag. But, it is still exactly as buggy as it was before for now...

~~There was a use-after free error, not sure yet why (or if even related) so saving for posterity~~. Nvm, I had a iterator debug print, and I suspect pytest may have inspected the operand arrays which are invalid at the end of the test.

MAINT: Reinstante the CONTIG flag (as buggy as it was)

92ecbaf

seberg force-pushed thereduce-buf-refactor branch from1352e7a to92ecbafCompare

December 5, 2024 18:39

seberg added4 commits

December 5, 2024 22:04

remove debug print (was this part of the sanitizer problem?!)

5bad357

TST: See if deleting a, b makes santizer tests pass

67cb5c5

MAINT: Move buffersize init (new code always limits it to itersize)

5508cc5

DOC: Explain buffer-setup cost more (maybe should rethink it...)

be6749d

seberg mentioned this pull request

Dec 14, 2024

API,MAINT: MakeNpyIter_GetTransferFlags public and avoid old uses#27998

Merged

mhvk reviewed

Dec 15, 2024

Copy link

Contributor

mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just comments on the long comment for now. Will try to get going on the rest later...

numpy/_core/src/multiarray/nditer_constr.c OutdatedShow resolvedHide resolved

numpy/_core/src/multiarray/nditer_constr.cShow resolvedHide resolved

numpy/_core/src/multiarray/nditer_constr.c Outdated

		* The functino here finds an outer dimension and it's "core" to use that
		* works with reductions.
		* While iterating, we will fill the buffers making sure that we:
		* - Never buffer beyond the outer dimension (optimize chance of re-use).

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What would it mean to "buffer beyond the outer dimension"? For some length N, I can imagine buffering some smaller M, but why ever M>N? I'm probably misreading this!

Copy link

MemberAuthor

sebergDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added two examples, hopefully they help (if not for this, then hopefully to show how to think about these iterations).

The outer dimension may be iterated in chunks (not fully), so if it is say 100, and we can fit 30 of them in at a time, you reach 90. Then we just buffer 10, not 30 again, because that would increment the dimension that comes next (with a different stride).

numpy/_core/src/multiarray/nditer_constr.c Outdated

		* - Never buffer beyond the outer dimension (optimize chance of re-use).
		* - Never buffer beyond the end of the core (all but outer dimension)
		* if the user manually moved the iterator to the middle of the core.
		* (Not typical and if, should usually happen once.)

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Garbled sentence! In the larger context, doesn't it just mean that if the iterator is manually set, things have to be recalculated?

Copy link

MemberAuthor

sebergDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Restructured the order and wording to make it clearer hopefully.

You always have to (re-)fill buffers when manually moved (or ranged).

The point is that I decided to just finish a single core (which could be very small). For reductions, you have to do that anyway. Otherwise, it just seems a bit simpler to not worry about what happens if you iterator 1.5 core-sizes (rather than N core-sizes).

numpy/_core/src/multiarray/nditer_constr.c OutdatedShow resolvedHide resolved

numpy/_core/src/multiarray/nditer_constr.c Outdated

		* We don't want to figure out the complicated copies every time we fill buffers
		* so here we want to find a the outer iteration dimension so that:
		* - Its core size is <= the maximum buffersize if buffering is needed.
		* - Allows for reductions to work (with or without reduce-mode)

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

follow grammatical structure of first item:

- Reductions are possible (with or without reduce mode);- Iteration overhead is minimized. We ...

numpy/_core/src/multiarray/nditer_constr.c Outdated

		* - Allows for reductions to work (with or without reduce-mode)
		* - Optimizes for iteration overhead. We estimate the total overhead with:
		* `(1 + n_buffers) / size`.
		* The `size` is the `min(core_size * outer_dim_size, buffersize)` and

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should this be:

`(1 + n_buffers) / size`, a guess of how often we fill buffers, with`size=min(core_size * outer_dim_size, buffersize)`.

But I think this is not very clear, this does not look like a number (it has 1/bytes as dimension). I think what one really is trying to minimize is,

`(1+n_buffers) * core_size * outer_dim_size / min(buffer_size, core_size * outer_dim_size)`.

In practice, since the total size is fixed, what you write does the same thing, but maybe it is clearer to write like I have it? I.e., how about the following for the whole entry:

- Iteration overhead is minimized. For this, we use the number of times buffers  are filled, `(1 + n_buffers) * total_size / min(total_size, buffer_size)`  (where `total_size = core_size * outer_dim_size`).

Copy link

MemberAuthor

sebergDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK,total_size is not actuallycore_size * outer_dim_size. Since there may be more dimensions (outer here is thefirst "outer" dimension in a sense).

(I added the "first" in a few places, maybe it helps... It would be nice to find a different name because "outer" is also a nice name for everything that isn't buffered, but for the "outer" dimension here it is just the outermost that is still iterated as a part of a buffer.
But, I can't think of one right now.)

Copy link

MemberAuthor

sebergDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

(completely changed/expanded the formulas, though...)

DOC: Expand/change iteration explanation based on Marten's review

426934d

seberg force-pushed thereduce-buf-refactor branch fromf37e35a to426934dCompare

December 15, 2024 18:13

mhvk reviewed

Dec 15, 2024

numpy/_core/src/multiarray/nditer_constr.c

Copy link

Contributor

mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

OK, a few more code comments, though none remotely serious. I cannot say that I really understand the code, so at some level am relying on tests passing. I do think your refactoring makes the code easier to follow (but I haven't really tried to understand the whole...).

numpy/_core/src/multiarray/nditer_constr.c Outdated

		NpyIter_BufferData*bufferdata=NIT_BUFFERDATA(iter);

		/*
		* We check two things here, first whether the core is single strided

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thesingle strided is a bit confusing; from the code, below, I'd call itcontiguous. Maybe change that? Otherwise, just add "(C contiguous)" here?

EDIT: or rather, I think this does not have to be contiguous at all, just consistent with the first stride. So, maybe add "(note that the stride of the first axis does not have to equal the element size; if it does, this is equivalent to the operand being C contiguous)".

Copy link

MemberAuthor

sebergDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Rephrased to say "first how many operand dimensions can be iterated using a single stride (all dimensions are consistent),"

Contiguouity of course doesn't matter (it only matters for that slightly awkward "contiguous" flag.)

numpy/_core/src/multiarray/nditer_constr.cShow resolvedHide resolved

numpy/_core/src/multiarray/nditer_constr.c Outdated

		/* For clarity: op_reduce_outer_dim[iop] if set always matches. */
		assert(!op_reduce_outer_dim[iop]\|\|op_reduce_outer_dim[iop]==outer_reduce_dim);
		}
		if (iop!=nop) {

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How can this be hit? I don't see abreak in thefor loop above...

Copy link

MemberAuthor

sebergDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just forgot to delete this. The first version I had broke immediately on finding a reduce op. But I wanted the info for all operands, so now it's breaking later.

		* NOTE: This is technically not true since we may still use
		* an outer reduce at this point.
		* So it prefers a non-reduce setup, which seems not
		* ideal, but OK.

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm still trying to grok this bit, but could one just swap the two pieces?

numpy/_core/src/multiarray/nditer_constr.c OutdatedShow resolvedHide resolved

numpy/_core/src/multiarray/nditer_impl.h OutdatedShow resolvedHide resolved

		npy_intpfact=NIT_ITERINDEX(iter) /NIT_BUFFERDATA(iter)->coresize;
		npy_intpoffset=NIT_ITERINDEX(iter)-fact*NIT_BUFFERDATA(iter)->coresize;
		NIT_BUFFERDATA(iter)->coreoffset=offset;
		}

Copy link

Contributor

mhvkDec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

For a next iteration, I think!

numpy/_core/src/multiarray/nditer_api.c OutdatedShow resolvedHide resolved

MAINT: More review comments (mostly code clarifications/simplifications)

1836fd1

mhvk reviewed

Dec 16, 2024

Copy link

Contributor

mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A few more small things, and one possible little buglet. I still do not feel I really understand everything. Best would be if someone else could have a look, but, if not, my sense would be to put this in, and continue in smaller pieces...

numpy/_core/src/multiarray/nditer_api.c OutdatedShow resolvedHide resolved

numpy/_core/src/multiarray/nditer_api.c Outdated

		reduce_outeraxisdata=NIT_INDEX_AXISDATA(axisdata,reduce_outerdim);
		if (itflags&NPY_ITFLAG_REDUCE) {
		npy_intpsizeof_axisdata=NIT_AXISDATA_SIZEOF(itflags,ndim,nop);
		outer_axisdata=NIT_INDEX_AXISDATA(axisdata,bufferdata->outerdim);

Copy link

Contributor

mhvkDec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not your problem, but, really, having macros require other variables to be defined is pretty bad form! Especially since hereaxisdata_incr could have made use ofsizeof_axisdata.

numpy/_core/src/multiarray/nditer_api.c OutdatedShow resolvedHide resolved

numpy/_core/src/multiarray/nditer_constr.c Outdated

		/*
		* Note that this code requires/uses the fact that there is always one
		* axisdata even for ndim == 0 (i.e. no need to worry about it).
		*/

Copy link

Contributor

mhvkDec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Again, only if you are making changes, maybe add:

 * Also, it uses that the operands have already been broadcast against * each other (and thus may have zero strides).

Copy link

MemberAuthor

sebergDec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Dont' feel it fits here (also moved this comment down tobest_ndim = 0.

numpy/_core/src/multiarray/nditer_constr.c Outdated

		}
		}
		NIT_BUFFERDATA(iter)->buffersize=best_size;
		/* Core size is 0 (unless the user applies a range explicitly). */

Copy link

Contributor

mhvkDec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

type? "Core starts at 0 (unless..."

numpy/_core/src/multiarray/nditer_constr.cShow resolvedHide resolved

MAINT: Smaller changes based on review

e758983

mattip reviewed

		else {
		/* If there's no buffering, the stridesare always fixed */
		/* If there's no buffering, the stridescome from the operands. */
		memcpy(out_strides,NAD_STRIDES(axisdata0),nop*NPY_SIZEOF_INTP);

Copy link

Member

mattipDec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A future cleanup might eliminate any internal calls to this API function and use the iter data directly.

mattip approved these changes

Copy link

Member

mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As far as I understood, it looks good. Might be nice to check with valgrind at some point before the next release.

Copy link

Member

mattip commentedDec 17, 2024

There are some follow ups, but nothing really jumps out as a possible continuation. Thanks@seberg

mattip merged commitf36fa72 intonumpy:main

65 of 66 checks passed

seberg deleted the reduce-buf-refactor branch

December 17, 2024 09:46

seberg mentioned this pull request