Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Brier score

From Wikipedia, the free encyclopedia
Measure of the accuracy of probabilistic predictions

TheBrier score is astrictly proper scoring rule that measures the accuracy ofprobabilistic predictions. For unidimensional predictions, it is strictly equivalent to themean squared error as applied to predicted probabilities.

The Brier score is applicable to tasks in which predictions must assign probabilities to a set ofmutually exclusive discrete outcomes or classes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1). It was proposed byGlenn W. Brier in 1950.[1]

The Brier score can be thought of as acost function. More precisely, across all itemsi1...N{\displaystyle i\in {1...N}} in a set ofN predictions, the Brier score measures the mean squared difference between:

  • The predicted probability assigned to the possible outcomes for itemi
  • The actual outcomeoi{\displaystyle o_{i}}

Therefore, thelower the Brier score is for a set of predictions, thebetter the predictions are calibrated. Note that the Brier score, in its most common formulation, takes on a value between zero and one, since this is the square of the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 or 1). In the original (1950) formulation of the Brier score, the range is double, from zero to two.

The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false, but it is inappropriate for ordinal variables which can take on three or more values.

Definition

[edit]

The most common formulation of the Brier score is

BS=1Nt=1N(ftot)2{\displaystyle BS={\frac {1}{N}}\sum \limits _{t=1}^{N}(f_{t}-o_{t})^{2}\,\!}

in whichft{\displaystyle f_{t}} is the probability that was forecast,ot{\displaystyle o_{t}} the actual outcome of the event at instancet{\displaystyle t} (0{\displaystyle 0} if it does not happen and1{\displaystyle 1} if it does happen) andN{\displaystyle N} is the number of forecasting instances. In effect, it is themean squared error of the forecast. This formulation is mostly used for binary events (for example "rain" or "no rain"). The above equation is a proper scoring rule only for binary events; if a multi-category forecast is to be evaluated, then the original definition given by Brier below should be used.

Example

[edit]

Suppose that one is forecasting the probabilityP{\displaystyle P} that it will rain on a given day. Then the Brier score is calculated as follows:

  • If the forecast is 100% (P{\displaystyle P} = 1) and it rains, then the Brier Score is 0, the best score achievable.
  • If the forecast is 100% and it does not rain, then the Brier Score is 1, the worst score achievable.
  • If the forecast is 70% (P{\displaystyle P} = 0.70) and it rains, then the Brier Score is (0.70−1)2 = 0.09.
  • In contrast, if the forecast is 70% (P{\displaystyle P} = 0.70) and it does not rain, then the Brier Score is (0.70−0)2 = 0.49.
  • Similarly, if the forecast is 30% (P{\displaystyle P} = 0.30) and it rains, then the Brier Score is (0.30−1)2 = 0.49.
  • If the forecast is 50% (P{\displaystyle P} = 0.50), then the Brier score is (0.50−1)2 = (0.50−0)2 = 0.25, regardless of whether it rains.

Original definition by Brier

[edit]

Although the above formulation is the most widely used, the original definition by Brier[1] is applicable to multi-category forecasts as well as it remains a proper scoring rule, while the binary form (as used in the examples above) is only proper for binary events. For binary forecasts, the original formulation of Brier's "probability score" has twice the value of the score currently known as the Brier score.

BS=1Nt=1Ni=1R(ftioti)2{\displaystyle BS={\frac {1}{N}}\sum \limits _{t=1}^{N}\sum \limits _{i=1}^{R}(f_{ti}-o_{ti})^{2}\,\!}

In whichR{\displaystyle R} is the number of possible classes in which the event can fall, andN{\displaystyle N} the overall number of instances of all classes.fti{\displaystyle f_{ti}} is the predicted probability for classi.oti{\displaystyle i.o_{ti}} is1{\displaystyle 1} if it isi{\displaystyle i}-th class in instancet{\displaystyle t};0{\displaystyle 0}, otherwise. For the case Rain / No rain,R=2{\displaystyle R=2}, while for the forecast Cold / Normal / Warm,R=3{\displaystyle R=3}.

Decompositions

[edit]

There are several decompositions of the Brier score which provide a deeper insight on the behavior of a binary classifier.

3-component decomposition

[edit]

The Brier score can be decomposed into 3 additive components: Uncertainty, Reliability, and Resolution. (Murphy 1973)[2]

BS=RELRES+UNC{\displaystyle BS=REL-RES+UNC}

Each of these components can be decomposed further according to the number of possible classes in which the event can fall. Abusing the equality sign:

BS=1Nk=1Knk(fko¯k)21Nk=1Knk(o¯ko¯)2+o¯(1o¯){\displaystyle BS={\frac {1}{N}}\sum \limits _{k=1}^{K}{n_{k}(\mathbf {f_{k}} -\mathbf {\bar {o}} _{\mathbf {k} })}^{2}-{\frac {1}{N}}\sum \limits _{k=1}^{K}{n_{k}(\mathbf {{\bar {o}}_{k}} -{\bar {\mathbf {o} }})}^{2}+\mathbf {\bar {o}} \left({1-\mathbf {\bar {o}} }\right)}

WithN{\displaystyle \textstyle N} being the total number of forecasts issued,K{\displaystyle \textstyle K} the number of unique forecasts issued,o¯=t=1Not/N{\displaystyle \mathbf {\bar {o}} ={\sum _{t=1}^{N}}\mathbf {o_{t}} /N} the observed climatologicalbase rate for the event to occur,nk{\displaystyle n_{k}} the number of forecasts with the same probability category ando¯k{\displaystyle \mathbf {\overline {o}} _{\mathbf {k} }} the observed frequency, given forecasts of probabilityfk{\displaystyle \mathbf {f_{k}} }. The bold notation in the above formula indicates vectors, which is another way of denoting the original definition of the score and decomposing it according to the number of possible classes in which the event can fall. For example, a 70% chance of rain and an occurrence of no rain are denoted asf=(0.3,0.7){\displaystyle \mathbf {f} =(0.3,0.7)} ando=(1,0){\displaystyle \mathbf {o} =(1,0)} respectively. Operations like the square and multiplication on these vectors are understood to be component wise. The Brier Score is then the sum of the resulting vector on the right hand side.

Reliability

[edit]

The reliability term measures how close the forecast probabilities are to the true probabilities, given that forecast. Reliability is defined in the contrary direction compared toEnglish language. If the reliability is 0, the forecast is perfectly reliable. For example, if we group all forecast instances where 80% chance of rain was forecast, we get a perfect reliability only if it rained 4 out of 5 times after such a forecast was issued.

Resolution

[edit]

The resolution term measures how much the conditional probabilities given by the different forecasts differ from the climatic average. The higher this term is, the better. In the worst case, when the climatic probability is always forecast, the resolution is zero. In the best case, when the conditional probabilities are zero and one, the resolution is equal to the uncertainty.

Uncertainty

[edit]

The uncertainty term measures the inherent uncertainty in the outcomes of the event. For binary events, it is at a maximum when each outcome occurs 50% of the time, and is minimal (zero) if an outcome always occurs or never occurs.

Two-component decomposition

[edit]

An alternative (and related) decomposition generates two terms instead of three.

BS=CAL+REF{\displaystyle BS=CAL+REF}
BS=1Nk=1Knk(fko¯k)2+1Nk=1Knk(o¯k(1o¯k)){\displaystyle BS={\frac {1}{N}}\sum \limits _{k=1}^{K}{n_{k}(\mathbf {f_{k}} -\mathbf {\bar {o}} _{\mathbf {k} })}^{2}+{\frac {1}{N}}\sum \limits _{k=1}^{K}{n_{k}(\mathbf {{\bar {o}}_{k}} (1-\mathbf {{\bar {o}}_{k}} }))}

The first term is known as calibration (and can be used as a measure of calibration, seestatistical calibration), and is equal to reliability. The second term is known as refinement, and it is an aggregation of resolution and uncertainty, and is related to the area under theROC Curve.

The Brier Score, and the CAL + REF decomposition, can be represented graphically through the so-called Brier Curves,[3] where the expected loss is shown for each operating condition. This makes the Brier Score a measure of aggregated performance under a uniform distribution of class asymmetries.[4]

Brier skill score (BSS)

[edit]

Askill score for a given underlying score is an offset and (negatively-) scaled variant of the underlying score such that a skill score value of zero means that the score for the predictions is merely as good as that of a set of baseline or reference or default predictions, while a skill score value of one (100%) represents the best possible score. A skill score value less than zero means that the performance is even worse than that of the baseline or reference predictions. When the underlying score is the Brier score (BS), theBrier skill score (BSS) is calculated as

BSS=1BSBSref{\displaystyle BSS=1-{\frac {BS}{BS_{ref}}}}

whereBSref{\displaystyle BS_{ref}} is the Brier score of reference or baseline predictions which we seek to improve on. While the reference predictions could in principle be given by any pre-existing model, by default one can use the naïve model that predicts the overall proportion or frequency of a given class in thedata set being scored, as the constant predicted probability of that class occurring in each instance in the data set. This baseline model would represent a "no skill" model that one seeks to improve on. Skill scores originate in the meteorological prediction literature, where the naïve default reference predictions are called the "in-sample climatology" predictions, where climatology means a long-term or overall average of weather predictions, and in-sample means as calculated from the present data set being scored.[5][6] In this default case, for binary (two-class) classification, the reference Brier score is given by (using the notation of the first equation of this article, at the top of the Definition section):

BSref=1Nt=1N(o¯ot)2{\displaystyle BS_{ref}={\frac {1}{N}}\sum \limits _{t=1}^{N}({\bar {o}}-o_{t})^{2}\,}

whereo¯{\displaystyle {\bar {o}}} is simply the average actual outcome, i.e. the overall proportion of true class 1 in the data set:

o¯=1Nt=1Not.{\displaystyle {\bar {o}}={\frac {1}{N}}\sum \limits _{t=1}^{N}o_{t}.}

With a Brier score, lower is better (it is a loss function) with 0 being the best possible score. But with a Brier skill score, higher is better with 1 (100%) being the best possible score.

The Brier skill score can be more interpretable than the Brier score because the BSS is simply the percentage improvement in the BS compared to the reference model, and a negative BSS means you are doing even worse than the reference model, which may not be obvious from looking at the Brier score itself. However, a BSS near 100% should not typically be expected because this would require that every probability prediction was nearly 0 or 1 (and was correct of course).

Even if the Brier score is astrictly proper scoring rule, the BSS is not strictly proper: indeed, skill scores are generally non-proper even if the underlying scoring rule is proper.[7] Still, Murphy (1973)[8] proved that the BSS is asymptotically proper with a large number of samples.

You might notice that classification's (probability estimation's) BSS is to its BS, as regression'scoefficient of determination (R2{\displaystyle R^{2}}) is to itsmean squared error (MSE).

Shortcomings

[edit]

The Brier score becomes inadequate for very rare (or very frequent) events, because it does not sufficiently discriminate between small changes in forecast that are significant for rare events.[9] Wilks (2010) has found that "[Q]uite largesample sizes, i.e. n > 1000, are required for higher-skill forecasts of relatively rare events, whereas only quite modest sample sizes are needed for low-skill forecasts of common events."[10]

Penalized Brier Score

[edit]

Although the Brier Score is a strictly proper scoring rule—minimized in expectation when the forecast distribution matches the true distribution—it does not satisfy the stronger "superior" property that every correct forecast must score strictly better than any incorrect one. Ahmadian et al. (2024) formally showed that, in a (R{\displaystyle R})-class classification setting, the maximum possible Brier Score for a correct prediction is ((R1)/R{\displaystyle (R-1)/R}).[11] Consequently, an incorrect but confident forecast can receive a better (lower) Brier Score than a correct but uncertain one.

To address this, the authors introduced the Penalized Brier Score (PBS), which adds a fixed penalty of ((R1)/R{\displaystyle (R-1)/R}) whenever the predicted class does not match the true class, i.e., when (argmaxpargmaxy{\displaystyle \arg \max p\neq \arg \max y}).[11] This ensures that all correct predictions are scored strictly better than any incorrect ones, satisfying the "superior" criterion. Despite this additional term, PBS retains strict propriety as defined by Gneiting and Raftery (2007): the expected PBS is uniquely minimized when the forecast equals the true distribution.[7]The modified Brier Score with the penalty term, Penalized Brier Score, can be expressed as:

PBS=1Nt=1Ni=1R(ftioti)2+{R1Rfξ 0otherwise{\displaystyle PBS={\frac {1}{N}}\sum _{t=1}^{N}\sum _{i=1}^{R}(f_{ti}-o_{ti})^{2}+\left\{{\begin{matrix}{\frac {R-1}{R}}&f\in \xi \\\ 0&{\text{otherwise}}\end{matrix}}\right.}

Experimental results suggest that PBS correlates more strongly with classification accuracy andF1 score during training, improvingmodel selection viaearly stopping and checkpointing.[11] PBS thus addresses a key shortcoming of the original Brier Score—its inability to always reward correct predictions more than incorrect ones.

See also

[edit]

References

[edit]
  1. ^abBrier (1950)."Verification of Forecasts Expressed in Terms of Probability"(PDF).Monthly Weather Review.78 (1):1–3.Bibcode:1950MWRv...78....1B.doi:10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2.S2CID 122906757. Archived fromthe original(PDF) on 2017-10-23.
  2. ^Murphy, A. H. (1973)."A new vector partition of the probability score".Journal of Applied Meteorology.12 (4):595–600.Bibcode:1973JApMe..12..595M.doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2.
  3. ^Hernandez-Orallo, J.; Flach, P.A.; Ferri, C. (2011)."Brier curves: a new cost-based visualisation of classifier performance"(PDF).Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 585–592.
  4. ^Hernandez-Orallo, J.; Flach, P.A.; Ferri, C. (2012)."A unified view of performance metrics: translating threshold choice into expected classification loss"(PDF).Journal of Machine Learning Research.13:2813–2869.
  5. ^A bias-corrected decomposition of the Brier score. (Notes and Correspondence.) C. A. T. Ferro and T. E. Fricker inQuarterly Journal of the Royal Meteorological Society, Volume 138, Issue 668, October 2012 Part A, Pages 1954-1960[1]
  6. ^"Numerical Weather Prediction: The MOGREPS short-range ensemble prediction system: Verification report: Trial Performance of MOGREPS: January 2006 - March 2007. Forecasting Research Technical Report No. 503." Neill Bowler, Marie Dando, Sarah Beare & Ken Mylne[2]
  7. ^abGneiting, Tilmann;Raftery, Adrian E. (2007)."Strictly Proper Scoring Rules, Prediction, and Estimation"(PDF).Journal of the American Statistical Association.102 (447):359–378.doi:10.1198/016214506000001437.S2CID 1878582.
  8. ^ Murphy, A. H. (1973). "Hedging and skill scores for probability forecasts".Journal of Applied Meteorology.12: 215–223.
  9. ^Riccardo Benedetti (2010-01-01)."Scoring Rules for Forecast Verification".Monthly Weather Review.138 (1):203–211.Bibcode:2010MWRv..138..203B.doi:10.1175/2009MWR2945.1.
  10. ^Wilks, D. S. (2010). "Sampling distributions of the Brier score and Brier skill score under serial dependence".Quarterly Journal of the Royal Meteorological Society.136 (1):2109–2118.Bibcode:2010QJRMS.136.2109W.doi:10.1002/qj.709.S2CID 121504347.
  11. ^abcAhmadian, Rouhollah; Ghatee, Mehdi; Wahlström, Johan (2025)."Superior scoring rules for probabilistic evaluation of single-label multi-class classification tasks".International Journal of Approximate Reasoning.182 109421. Elsevier.arXiv:2407.17697.doi:10.1016/j.ijar.2025.109421.ISSN 0888-613X.

Further reading

[edit]
Retrieved from "https://en.wikipedia.org/w/index.php?title=Brier_score&oldid=1318432602"
Category:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp