Agg already has RGB resampling with output to RGBA builtin, so we just need to correctly wire up the corresponding templates. With this RGB resampling mode, we save the extra copy from RGB to RGBA in NumPy land that was required for the previous always-RGBA resampling.

With the example from#29434, this saves 800MB of peak memory usage,~~which actually works out totwo copies of 10000x10000x4 bytes; I'm not sure where the second copy comes from.~~ I was thinking it was 4-byte RGBA, but it's actually 8-byte float64 input, so 1 copy removed as expected.

While this does pass tests, I think there may be some room for further improvement, as the input to the C++ code isA[..., :3], which should be a discontiguous view, but then the pybind11 code forces it to be contiguous, which would make another copy. There are some offset and step settings in Agg that I think we can use to avoid this copy as well, but it might require extra template expansions. I have implemented this to save another copy.

PR checklist

"closes #0000" is in the body of the PR description tolink the related issue
new and changed code istested
[n/a]Plotting related features are demonstrated in anexample
[n/a]New Features andAPI Changes are noted with adirective and release note
[n/a] Documentation complies withgeneral anddocstring guidelines

QuLogic added2 commits

January 10, 2025 18:36

Remove unused pixfmt_pre_type

0e0e71c

Remove unused fixed_blender_rgba_pre

2aca299

QuLogic added status: waiting for other PR topic: images labels

Jan 11, 2025

Resample RGB images in C++

2b73e7e

Agg already has RGB resampling with output to RGBA builtin, so we justneed to correctly wire up the corresponding templates. With this RGBresampling mode, we save the extra copy from RGB to RGBA in NumPy landthat was required for the previous always-RGBA resampling.

QuLogic force-pushed theresample-rgb branch from60e39ae to2b73e7eCompare

January 15, 2025 02:02

Avoid another copy when RGBA is resampled as RGB

769f7a4

In the case of RGBA, the RGB and A channels are resampled separately,but they are created as a view on the original to pass to the C++ code.The C++ code then copies it to a contiguous buffer, but Agg's RGBresampler supports manually stepping the RGB input by a custom stride.As this step is a template parameter, we can't handle any arbitrarayarray, but can special case steps of 3 or 4 units, which should coverthe common cases of RGB or RGBA-viewed-as-RGB input.

tacaswell added this to thev3.11.0 milestone

Jan 17, 2025

Copy link

MemberAuthor

QuLogic commentedJan 23, 2025

Usingmemray on the example from#29434, we can find 3 major memory contributors:

A (800MB) copy madeinImageBase.set_data; this one is expected as we don't want to use or modify external data.
A (3.2GB) copy madeinColormap._get_rgba_and_mask; this is part of the colour mapping process and expected (though for the old interpolation stage, this originally happened after resampling, so was smaller.)
A (3.2GB) copy madein_rgb_to_rgba in order to make an RGBA array for resampling in C++ withjust the RGB from point 2.

Usingpyinstrument, we can find that most processing time is taken up byAxesImage._make_image, specifically_to_rgba (41.5%),_rgb_to_rgba (40.5%), and_resample (16.4%).

The first commit here removes entry 3 by applying Agg templates for RGB, but we still end up allocating a similar size one in_resample (to make a contiguous array), but without the alpha channel, so only 2.4GB, meaning a drop of 800MB. Instrumentation doesn't show much change in runtime.

The second commit removes the need for that latter copy as well by applyingmore Agg templates for step size of 3 or 4. Strangely though, memray now attributesmore allocations toColorizer.to_rgba 4.7GiB vs 3.0 GiB before. Insrtumentation does show a good improvement in runtime:

I thinkmemray must only show the blocks that contribute to peak usage (which explains why the attribution changes around a bit after the commits). Reported peak memory usage shrank from 6.8 GiB to 5.6 GiB (17.6% decrease), and runtime (ofAxes._make_image specifically) from 19.146s to 12.467s (34.9% decrease).

There are a few other copies that I can find by inspection:

The initial data is 10000x10000xfloat=800M bytes, though since it's passed toimshow directly, it gets immediately discarded.
A copyof the input in_Colormap._get_rgba_and_mask; this is so that it can be modified (with a byteswap or scaling from float to int).
Three masks for over/under/bad, but these are booleans, not float64.
Acast fromfloat toint; this is used for the indexing into the lut.

Copies 1, 2, and 4 are 8 bytes, so 800M each, and copy 3 should be 3*byte = 300M. Many of these are replacing the previous version, or otherwise discarded, so I guess they don't count formemray.

I believe we can remove copies 2-4 with a port of_get_rgba_and_mask to C++, in order to directly output the final RGBA without any intermediaries. I didn't implement the alpha part yet, but a quick implementation shows this will cut down peak memory usage to 4.6 GiB (which is probably as low as we can go for original, RGBA, and resampled), and runtime of_get_rgba_and_mask itself from 9.757s to 1.822s.

I haven't committed that last thing, as I didn't finish alpha processing and wasn't sure if we wanted to move it to C++, but I think the numbers indicate this is a good idea.

Copy link

Member

timhoffm commentedJan 24, 2025

On a side note,Colormap._get_rgba_and_mask has the peculiar behavior that integers are used to directly index into the lut, whereas floats are scaled before lookup so that 0...1 covers the color range (#28198). We want to move away from this anyway, basically deprecate accepting ints here. But we have to decide whether we need this integer-lookup capability and if so, how to make a reasonable API for it. Thoughts welcome.