NotificationsYou must be signed in to change notification settings
Fork8.1k
Star22.1k

[Sprint] scatter plots are (reportedly) too slow#2156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

mdboom merged 2 commits intomatplotlib:masterfrommdboom:scatter-speed

Sep 30, 2013

Merged

[Sprint] scatter plots are (reportedly) too slow#2156

mdboom merged 2 commits intomatplotlib:masterfrommdboom:scatter-speed

Sep 30, 2013

Conversation

Copy link

Member

mdboom commentedJul 22, 2013

@ChrisBeaumont reported in his great SciPy talk that scatter plots are too slow. We should get to the bottom of that 😉.

Copy link

Contributor

ChrisBeaumont commentedJun 26, 2013

This is essentially the issue discussedhere -- callingplt.plot is about 10x faster thanplt.scatter on my machine. I think this is becauseplot copies a stamp over and over in Agg, whereasscatter has to re-draw markers at every datapoint.

It would begreat if matplotlib had some way of mapping size and/or color to data, without sacrificing all of that speed! I have to think there is some middle ground between the speed ofplot and the flexibility ofscatter.

Copy link

Member

WeatherGod commentedJun 27, 2013

Last year, john Hunter explained that he did work to optimize line
collections, which assume the same marker throughout a line, while I think
scatter makes patch collections, which can't? I don't exactly recall it
all, but he did say that some of the optimizations could be brought over to
scatter.

Ben
On Jun 26, 2013 6:18 PM, "Chris Beaumont"notifications@github.com wrote:

This is essentially the issue discussed herehttps://github.com/astrofrog/mpl-scatter-density/issues/2#issuecomment-2508221-- calling
plt.plot is about 10x faster than plt.scatter on my machine. I think this
is because plot renders a stamp over and over, whereas scatter has to
re-draw markers at every datapoint.
It would begreat if matplotlib had some way of mapping size and/or
color to data, without sacrificing all of that speed! I have to think there
is some middle ground between the speed of plot and the flexibility of
scatter.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2156#issuecomment-20087105
.

Copy link

MemberAuthor

mdboom commentedJul 22, 2013

@ChrisBeaumont: I've attached some low-hanging optimizations to scatter.

I've benchmarked this using points of different sizes (5, 25, 125 pixel diameters) and different numbers of points (100, 1000, 10000, 100000 and 1000000). Here's the baseline timings:

Greater than linear growth -- probably mostly due to memory/cache pressures with the greater number of points. Here's the relative speedups with this patch.

Three separate optimizations were found:

zip was being used to combine the X and Y coordinates.np.dstack is obviously much better.

When the points are all the same size, shape and color (etc...), it falls back todraw_markers, which draws the shape once as scanlines and then draws from those scanlines multiple times.

The transformation for each scatter point was being stored in a list of Transform objects. Storing the transformation matrices in a Nx3x3 array is much more efficient both to create and run through.

Beyond this, I haven't found much other low-hanging fruit that doesn't involve changing the output. I'll save that report for another issue.

Copy link

MemberAuthor

mdboom commentedJul 22, 2013

The benchmark, FYI.

import matplotlibmatplotlib.use("Agg")import matplotlib.pyplot as pltimport numpy as npfrom numpy import randomimport ioimport timeimport sysnpoints = int(sys.argv[1])size = int(sys.argv[2])np.random.seed(0)xy = np.random.rand(npoints, 2)if size > 0:    sizes = np.power(np.random.rand(npoints) * float(size), 2.0)else:    sizes = -sizes = time.time()x = plt.scatter(xy[:, 0], xy[:, 1], sizes)plt.savefig(io.BytesIO())print(time.time() - s)

@ChrisBeaumont: I'd appreciate some testing in the context of the kind of data you're seeing, just to see how it compares to this completely random and synthetic benchmark.

WeatherGod reviewed

Jul 22, 2013

View reviewed changes

src/_backend_agg.cpp

Copy link

Member

WeatherGodJul 22, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Just curious, what is the purpose of the clipping change in _backend_agg.cpp? Was that a bug?

Copy link

MemberAuthor

mdboomJul 22, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yep. It was a bug where it would avoid drawing some legitimate things.

Copy link

Contributor

ChrisBeaumont commentedJul 22, 2013

Nice! I'll play around with this.

Copy link

Contributor

ChrisBeaumont commentedJul 22, 2013

My benchmarks are similar: before this PR:

And after (note the different yaxis scaling):

The comparison withplot is with `plot(x, y, 'o', ms=4), which produces output quite similar to default scatter.

Unfortunately, the biggest speedup in this PR (the blitting) essentially replicates whatplot can do already. The compelling functionality ofscatter is, IMO, the ability to map color and/or size onto data. I can envision two "medium"-hanging fruit optimizations, that might push this kind of functionality into the 10^5-6 points range:

draw_markers could accept an optional color array for each marker. I'm not positive about the pixel blending math, but I suspect it's possible to pre-render a stamp and then blit it with different tints each time? If so, this would speed up calls to scatter with constant sizes but variable colors.

Sometimes the number of combinations of colors/sizes/markers will be small, especially if they are discretized first. In these situations, scatter would benefit from caching the markers, and using the same blitting trick in draw_markers.

Do you think either is feasible? I could take a crack at them in a new PR, to earn my matplotlib/C++ merit badge

Copy link

MemberAuthor

mdboom commentedJul 22, 2013

I think handling multiple colors in draw_markers should be possible. It probably requires implementing a custom blender for agg, though. Sounds, um, magical.

As for caching markers based on multiple sizes, that should also be possible. I had experimented with generating one stamp and then resampling it to different sizes, but, even with low-quality nearest neighbor resampling, doesn't gain much over just rendering the vector object directly. But if you built a discrete number of sizes and used those, that should help, I suspect. We probably want some sort of option to scatter where one specifies a number of "bins" to sort the sizes into (or maybe have an "auto" option that would do something smarter). And maybe not do anything special when the number of points is small, since it will add some preprocessing overhead.

Copy link

dpsanders commentedJul 24, 2013

Looks really nice!
A non-matplotlib comment: If you want to claim something about the growth compared to linear, it would be useful to add a linear growth comparison to the graph.

Copy link

Contributor

ChrisBeaumont commentedJul 24, 2013

Anything with curvature in a log-log plot is definitely super-linear :)

Copy link

dpsanders commentedJul 24, 2013

True, but even with curvature it might have not have reached linear yet by
the end of the time window shown!

On Wed, Jul 24, 2013 at 8:28 AM, Chris Beaumontnotifications@github.comwrote:

Anything with curvature in a log-log plot is definitely super-linear :)
—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2156#issuecomment-21484316
.

Dr. David P. Sanders

Profesor Titular "A" / Associate Professor
Departamento de Física, Facultad de Ciencias
Universidad Nacional Autónoma de México (UNAM)

dpsanders@gmail.com
http://sistemas.fciencias.unam.mx/~dsanders

Cubículo / office:#414, 4o. piso del Depto. de Física

Tel.: +52 55 5622 4965

mdboom mentioned this pull request

Aug 26, 2013

speedup figure rendering removal of .startswith() calls and use generato...#2289

Closed

mdboom added2 commits

September 30, 2013 14:58

Use dstack instead of zip to combine offset arrays in scatter.

d75d39f

Store transformations on collections as an Nx3x3 array, rather than a…

b8726d0

… list of Transform objects.When the collection has the same styling and only varies by offsets, use draw_markers instead.

Copy link

MemberAuthor

mdboom commentedSep 30, 2013

I know@ChrisBeaumont wanted to take this even further, but I wonder if there's any objections to merging this as-is (since it appears to be an improvement on all fronts anyway, even if it doesn't go as far as it could).

Copy link

Contributor

ChrisBeaumont commentedSep 30, 2013

+1 on merging. I do still want to push on this more, but am mired in thesis work for the next month. In any event, that can happen in a new PR

mdboom added a commit that referenced this pull request

Sep 30, 2013

Merge pull request#2156from mdboom/scatter-speed

379b14c

[Sprint] scatter plots are (reportedly) too slow

mdboom merged commit379b14c intomatplotlib:master

Sep 30, 2013

mdboom deleted the scatter-speed branch

August 7, 2014 13:52

tacaswell mentioned this pull request

Dec 4, 2014

Update collections.py#3888

Merged

jzwinck mentioned this pull request

Aug 18, 2017

Scatter plots are very slow when using multiple colors#9053

Open

Labels

None yet

7 participants

Movatterモバイル変換

Uh oh!

[Sprint] scatter plots are (reportedly) too slow#2156

[Sprint] scatter plots are (reportedly) too slow#2156

Uh oh!

Conversation

mdboom commentedJul 22, 2013

Uh oh!

ChrisBeaumont commentedJun 26, 2013

Uh oh!

WeatherGod commentedJun 27, 2013

Uh oh!

mdboom commentedJul 22, 2013

Uh oh!

mdboom commentedJul 22, 2013

Uh oh!

WeatherGodJul 22, 2013

Choose a reason for hiding this comment

Uh oh!

mdboomJul 22, 2013

Choose a reason for hiding this comment

Uh oh!

ChrisBeaumont commentedJul 22, 2013

Uh oh!

ChrisBeaumont commentedJul 22, 2013

Uh oh!

mdboom commentedJul 22, 2013

Uh oh!

dpsanders commentedJul 24, 2013

Uh oh!

ChrisBeaumont commentedJul 24, 2013

Uh oh!

dpsanders commentedJul 24, 2013

Uh oh!

mdboom commentedSep 30, 2013

Uh oh!

ChrisBeaumont commentedSep 30, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants