Movatterモバイル変換


[0]ホーム

URL:


SciPy

numpy.histogram

numpy.histogram(a,bins=10,range=None,normed=False,weights=None,density=None)[source]

Compute the histogram of a set of data.

Parameters:

a : array_like

Input data. The histogram is computed over the flattened array.

bins : int or sequence of scalars or str, optional

Ifbins is an int, it defines the number of equal-widthbins in the given range (10, by default). Ifbins is asequence, it defines the bin edges, including the rightmostedge, allowing for non-uniform bin widths.

New in version 1.11.0.

Ifbins is a string from the list below,histogram will usethe method chosen to calculate the optimal bin width andconsequently the number of bins (seeNotes for more detail onthe estimators) from the data that falls within the requestedrange. While the bin width will be optimal for the actual datain the range, the number of bins will be computed to fill theentire range, including the empty portions. For visualisation,using the ‘auto’ option is suggested. Weighted data is notsupported for automated bin size selection.

‘auto’

Maximum of the ‘sturges’ and ‘fd’ estimators. Provides goodall around performance.

‘fd’ (Freedman Diaconis Estimator)

Robust (resilient to outliers) estimator that takes intoaccount data variability and data size.

‘doane’

An improved version of Sturges’ estimator that works betterwith non-normal datasets.

‘scott’

Less robust estimator that that takes into account datavariability and data size.

‘rice’

Estimator does not take variability into account, only datasize. Commonly overestimates number of bins required.

‘sturges’

R’s default method, only accounts for data size. Onlyoptimal for gaussian data and underestimates number of binsfor large non-gaussian datasets.

‘sqrt’

Square root (of data size) estimator, used by Excel andother programs for its speed and simplicity.

range : (float, float), optional

The lower and upper range of the bins. If not provided, rangeis simply(a.min(),a.max()). Values outside the range areignored. The first element of the range must be less than orequal to the second.range affects the automatic bincomputation as well. While bin width is computed to be optimalbased on the actual data withinrange, the bin count will fillthe entire range including portions containing no data.

normed : bool, optional

This keyword is deprecated in NumPy 1.6.0 due to confusing/buggybehavior. It will be removed in NumPy 2.0.0. Use thedensitykeyword instead. IfFalse, the result will contain thenumber of samples in each bin. IfTrue, the result is thevalue of the probabilitydensity function at the bin,normalized such that theintegral over the range is 1. Notethat this latter behavior is known to be buggy with unequal binwidths; usedensity instead.

weights : array_like, optional

An array of weights, of the same shape asa. Each value ina only contributes its associated weight towards the bin count(instead of 1). Ifdensity is True, the weights arenormalized, so that the integral of the density over the rangeremains 1.

density : bool, optional

IfFalse, the result will contain the number of samples ineach bin. IfTrue, the result is the value of theprobabilitydensity function at the bin, normalized such thattheintegral over the range is 1. Note that the sum of thehistogram values will not be equal to 1 unless bins of unitywidth are chosen; it is not a probabilitymass function.

Overrides thenormed keyword if given.

Returns:

hist : array

The values of the histogram. Seedensity andweights for adescription of the possible semantics.

bin_edges : array of dtype float

Return the bin edges(length(hist)+1).

Notes

All but the last (righthand-most) bin is half-open. In other words,ifbins is:

[1,2,3,4]

then the first bin is[1,2) (including 1, but excluding 2) andthe second[2,3). The last bin, however, is[3,4], whichincludes 4.

New in version 1.11.0.

The methods to estimate the optimal number of bins are well foundedin literature, and are inspired by the choices R provides forhistogram visualisation. Note that having the number of binsproportional ton^{1/3} is asymptotically optimal, which iswhy it appears in most estimators. These are simply plug-in methodsthat give good starting points for number of bins. In the equationsbelow,h is the binwidth andn_h is the number ofbins. All estimators that compute bin counts are recast to bin widthusing theptp of the data. The final bin count is obtained from``np.round(np.ceil(range / h))`.

‘Auto’ (maximum of the ‘Sturges’ and ‘FD’ estimators)
A compromise to get a good value. For small datasets the Sturgesvalue will usually be chosen, while larger datasets will usuallydefault to FD. Avoids the overly conservative behaviour of FDand Sturges for small and large datasets respectively.Switchover point is usuallya.size \approx 1000.
‘FD’ (Freedman Diaconis Estimator)

h = 2 \frac{IQR}{n^{1/3}}

The binwidth is proportional to the interquartile range (IQR)and inversely proportional to cube root of a.size. Can be tooconservative for small datasets, but is quite good for largedatasets. The IQR is very robust to outliers.

‘Scott’

h = \sigma \sqrt[3]{\frac{24 * \sqrt{\pi}}{n}}

The binwidth is proportional to the standard deviation of thedata and inversely proportional to cube root ofx.size. Canbe too conservative for small datasets, but is quite good forlarge datasets. The standard deviation is not very robust tooutliers. Values are very similar to the Freedman-Diaconisestimator in the absence of outliers.

‘Rice’

n_h = 2n^{1/3}

The number of bins is only proportional to cube root ofa.size. It tends to overestimate the number of bins and itdoes not take into account data variability.

‘Sturges’

n_h = \log _{2}n+1

The number of bins is the base 2 log ofa.size. Thisestimator assumes normality of data and is too conservative forlarger, non-normal datasets. This is the default method in R’shist method.

‘Doane’

n_h = 1 + \log_{2}(n) +            \log_{2}(1 + \frac{|g_1|}{\sigma_{g_1}})g_1 = mean[(\frac{x - \mu}{\sigma})^3]\sigma_{g_1} = \sqrt{\frac{6(n - 2)}{(n + 1)(n + 3)}}

An improved version of Sturges’ formula that produces betterestimates for non-normal datasets. This estimator attempts toaccount for the skew of the data.

‘Sqrt’

n_h = \sqrt n

The simplest and fastest estimator. Only takes into account thedata size.

Examples

>>>np.histogram([1,2,1],bins=[0,1,2,3])(array([0, 2, 1]), array([0, 1, 2, 3]))>>>np.histogram(np.arange(4),bins=np.arange(5),density=True)(array([ 0.25,  0.25,  0.25,  0.25]), array([0, 1, 2, 3, 4]))>>>np.histogram([[1,2,1],[1,0,1]],bins=[0,1,2,3])(array([1, 4, 1]), array([0, 1, 2, 3]))
>>>a=np.arange(5)>>>hist,bin_edges=np.histogram(a,density=True)>>>histarray([ 0.5,  0. ,  0.5,  0. ,  0. ,  0.5,  0. ,  0.5,  0. ,  0.5])>>>hist.sum()2.4999999999999996>>>np.sum(hist*np.diff(bin_edges))1.0

New in version 1.11.0.

Automated Bin Selection Methods example, using 2 peak random datawith 2000 points:

>>>importmatplotlib.pyplotasplt>>>rng=np.random.RandomState(10)# deterministic random data>>>a=np.hstack((rng.normal(size=1000),...rng.normal(loc=5,scale=2,size=1000)))>>>plt.hist(a,bins='auto')# arguments are passed to np.histogram>>>plt.title("Histogram with 'auto' bins")>>>plt.show()

(Source code,png,pdf)

../../_images/numpy-histogram-1.png

Previous topic

numpy.cov

Next topic

numpy.histogram2d

  • © Copyright 2008-2009, The Scipy community.
  • Last updated on Jun 10, 2017.
  • Created usingSphinx 1.5.3.

[8]ページ先頭

©2009-2025 Movatter.jp