Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork8.1k
DOC: normalizing histograms#27426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
story645 left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There's a lot of good information here, but honestly it feels a bit overwhelming. I think adding some headings in the normalizing bins sections highlighting what you're trying to show in each subsection may help anchor the reader & honestly make it more likely they won't just skim over the thing but actually look at the section that's relevant for 'em. ETA: same w. the code honestly -> separating the plotting from the labeling code a bit more might make it easier to single out what functionality is trying to be highlighted here.
| # to make the point very obvious, consider bins that do not have the same | ||
| # spacing. By normalizing by density, we preserve the shape of the | ||
| # distribution, whereas if we do not, then the wider bins have much higher | ||
| # values than the thin bins: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If this is supposed to be about appropriate bin choice, then separate that out into it's own example w/ a side by side of equally and irregularly spaced bins? Basically I'm reading this trying to visualize the point you're trying to make, so you may as well just visualize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
At minimum, I had to read this a couple of times to understand what was going on because of the way that the top sentence was breaking. I'm honestly still not sure if this is what I'm supposed to parse out of this:
By normalizing by density, we preserve the shape of thedistribution. We emphasize this point using irregularly spaced bins to show that in the unnormalized example the frequencies are much more sensitive to bin width.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I've added a new section before this, as I agree this was too big a leap. Sorry - I was still playing with it, and should have marked as draft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
No, you did mark it draft - I should have asked if draft meant ready for feedback. I mark draft as an I don't think it's ready to merge, but ready for reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Sorry and ok, got that for next time. Limit of github that there isn't a third -> ready for feedback but still nascent option :/ ETA: meaning I take that review very literally as ready for final it can be merged on approval review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think it's fine to solicit review for a draft PR, but I just use Draft to abuse CI. Maybe a bad habit, but nicer for a server farm to build the docs for me than doing it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That's fair - code spaces might also be really good for your use case and faster.
I'm thinking about proposing adding a field to the PR template asking `if draft, would you like feedback?[], anything in particular" b/c I think this might be a fairly common expectation mismatch (half of us are drafts are for early round feedback while messing around and encourage folks to use it as such, half of us are drafts are messing around before feedback please don't touch) and it's one of those things I don't think folks necessarily think to communicate or ask about.
Uh oh!
There was an error while loading.Please reload this page.
981021d to3e98181Comparee00fe2e toaa52a1dCompareUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
oscargus commentedDec 4, 2023
I spent some time reading up on this the other day, so I think this is a valuable addition! (I wanted the probability mass function it turned out.) |
a0905d4 tob1afe65Compare| # %% | ||
| # This normalization can be a little hard to interpret when just exploring the | ||
| # data. The value attached to each bar is divided by the total number of data | ||
| # points _and_ the width of the bin, and thus the values _integrate_ to one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
| # points_and_ the width of the bin, and thus the values_integrate_ to one | |
| # points*and* the width of the bin, and thus the values*integrate* to one |
we are inrst notmd here
| # e.g. (``density = counts / (sum(counts) * np.diff(bins))``), | ||
| # and (``np.sum(density * np.diff(bins)) == 1``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
| # e.g. (``density = counts / (sum(counts) * np.diff(bins))``), | |
| # and (``np.sum(density * np.diff(bins)) == 1``). | |
| # e.g. :: | |
| # | |
| # density = counts / (sum(counts) * np.diff(bins)) | |
| # np.sum(density * np.diff(bins)) == 1 |
The in-line code snippets are hard to read the way they got wrapped, do as a code block?
tacaswell left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I left two small style comments, but this is a significant improvement even without them.
@jklymak can self-merge with or without my suggestions.
story645 commentedDec 5, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
The reason I keep hammering on scoping/chunking/headings is the same reason I'm reorganizing the contributing docs so that information is better binned -> otherwise my brain will just gloss over the content even when I'm trying really hard to figure it out because issues withworking memory are really common for folks with dyslexia, burnout, or ADHD (👋). A very very low hanging fruit way to make the docs more accessible is some anchoring through titles and headings. Nothing complicated, just tell folks what they should be focusing on/picking out from a subsection/example. Smaller/cleaner examples would be great too, but subheadings are I don't think a big enough ask to be out of scope & much easier to put in now then on a new cycle. ETA: And I'm specifically making this ask of Jody b/c he's an experienced educator and technical writer - I wouldn't necessarily make this ask in other contexts. |
story645 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm requesting changes b/c the .rst up top is over indented and that needs to be fixed.
| - bin the data as you want, either with an automatically chosen number of | ||
| bins, or with fixed bin edges, | ||
| - normalize the histogram so that its integral is one, | ||
| - and assign weights to the data points, so that each data point affects the | ||
| count in its bin differently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
| -binthedataasyouwant,eitherwithanautomaticallychosennumberof | |
| bins,orwithfixedbinedges, | |
| -normalizethehistogramsothatitsintegralisone, | |
| -andassignweightstothedatapoints,sothateachdatapointaffectsthe | |
| countinitsbindifferently. | |
| -binthedataasyouwant,eitherwithanautomaticallychosennumberof | |
| bins,orwithfixedbinedges, | |
| -normalizethehistogramsothatitsintegralisone, | |
| -andassignweightstothedatapoints,sothateachdatapointaffectsthe | |
| countinitsbindifferently. |
| fig, ax = plt.subplot_mosaic([['False', 'True']], layout='constrained') | ||
| dx = 0.1 | ||
| xbins = np.arange(-4, 4, dx) | ||
| ax['False'].hist(xdata, bins=xbins, density=False, histtype='step', label='Counts') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
For all the bin width comparisons, it's kind of hard to tell the bin widths from the 'step' type - is there a way to actually show the bins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
With so many bins, vertical lines end up almost merging. Fewer bins, it's hard to see the normal distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What about stacking as rows instead of columns so you have more horizontal space to work in?
| xbins = np.arange(-4, 4, dx) | ||
| # expected histogram: | ||
| ax['False'].plot(xpdf, pdf*1000*dx, '--', color=f'C{nn}') | ||
| ax['False'].hist(xdata, bins=xbins, density=False, histtype='step') | ||
| ax['True'].hist(xdata, bins=xbins, density=True, histtype='step', label=dx) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Do you need both here and all three? only because the busyness makes it feel very cluttered in a way where it's hard to read off the lesson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This is the main point of why you want to normalize, so comparing and contrasting with and without normalizing is the goal. Multiple bin sizes is to better give the reader an idea of how the bin size affects their results. Sure, it's busy, but I don't think incomprehensibly so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Maybe same as above, stacking horizontally ? or maybe thinner lines + making the histogram colors paler so that it's easier to visually distinguish?
jklymak commentedDec 6, 2023
The goal of this example is not to provide a random-access reference. That is already in the API docs. The goal is to explain carefully what the normalization options do so that we can point folks who ask why we normalize histograms the way we do towards this reference for an explanation. I consider this a relatively advanced topic, and I think breaking it into more sections or subsections would be more distracting than helpful. |
story645 commentedDec 6, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
That's mutually exclusive from its accessibility in the universal design context? I'm not saying make the text accessible to folks who don't have the mathematical background or Python knowledge to parse it, I'm asking that you make it more accessible to folks who's brains may be wired a bit different. Plenty of writing on advanced topics (and most academic papers) break things out - a great example is7 Stages in Compositionality, which is intro to category theory
Is it distracting for you? Because then it's competing needs and we should figure out if we can find something that works for both of us. The lack of sections is super distracting for me, b/c I don't know where to focus or where one part ends or the other begins. And as someone who frequently links folks to our docs, it's nice when I have a specific place to do so. Reading through it again, I'm guessing the following, which if I'm correct I don't see the harm in putting this in the document & if I'm wrong it would be helpful to have that table setting in the document.
ETA: Also I could reverse outline it b/c I've read this document a bunch of times now and reverse outlined it out of frustration b/c I was trying to understand the structure. And it's clearly well structured, and my hypothesis is folks would be more enticed to read it if they could see at a scan from the outline that it's well structured. |
| # (``density = counts / (sum(counts) * np.diff(bins))``) | ||
| # (``np.sum(density * np.diff(bins)) == 1``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
this isn't compiling correctly.
a305177 tof2da1f0Comparestory645 commentedDec 6, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Also I think this is a conceptual tutorial on normalizing histograms far more than a gallery example on how to use the function. Basically going by the rough distinction on the index page of: Demo-> plot type and example galleries I think this is far more usage than demo. |
jklymak commentedDec 7, 2023
Thanks for the discussion. I'm going to decline to modify this example further at this time. I think the current contribution crosses the bar of being a helpful addition. I'd suggest further changes could be follow-up PRs. |
story645 commentedDec 7, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
If that's true, then why not do it here? There's no urgency to get this in. ETA: Also why not move it to tutorials, where it'll be easier to find? |
56b4b24 tob58498dComparejklymak commentedDec 7, 2023
@story645 I do not agree with your changes. Is there a reason you are force pushing onto my branch? |
b58498d tof2da1f0Comparestory645 commentedDec 7, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Accident so I undid it? You can rebase to drop me from the commit history but I had to rebase to drop - I was using the gh cli & didn't realize it would just push back to yours. |
jklymak commentedDec 7, 2023
I'll self merge based on#27426 (review) |
tacaswell commentedDec 7, 2023
Reviewing this was on my todo list for yesterday and this morning, sorry I did not get to it in time. |
jklymak commentedDec 7, 2023
@tacaswell, it didn't change from your previous review except for a typo. |


People often seem confused by thedensity kwarg of
hist. I don't think we should change it, but we could document better.This was similar enough to
histogram_featuresthat I removed that and added a redirect.