# HEPEX Science and Challenges: Verification of Ensemble Forecasts (2/4)

*Contributed by James Brown and Julie Demargne
*

**What are the different attributes of forecast quality?**

There are many aspects or “attributes” of forecast quality which, although measured differently by particular verification metrics, share common origins in the joint probability distribution of the forecasts and observations (see below). Attributes of forecast quality include (Wilks, 2006, Jolliffe and Stephenson, 2003):

**Bias**of the single-valued estimate from the forecast (or first order bias, overall bias, unconditional bias), to measure its agreement with the observed outcome on average.

**Accuracy**of the forecast by measuring the difference between the forecast and the observation, which is the forecast error.

**Correlation**of the single-valued estimate from the forecast, to describe its linear relationship with observations.

**Skill**to estimate if the verified forecast is of a higher or lower quality than a given reference forecast. Skill requires the selection of one verification metric and one reference forecast, which could be climatology, persistence or a baseline forecast. This is particularly important when comparing forecast systems across different hydroclimatic regimes or to establish improvements in forecast systems.

**Reliability**or Type-I conditional bias to describe the agreement between, for one or more subsamples of the verification data, the observations for the subsamples and the respective forecasts. It is relative to the conditional distribution of the observations given the forecasts. For example, a flood ensemble forecast system is reliable, or conditionally unbiased in its forecast probabilities, if flooding is observed 20% of the time when it is forecast with probability 0.2 (the evaluation being repeated for all forecast probabilities). Note that, when conditioning on the observed variable, the resulting bias is known as “Type II conditional bias.”

**Resolution**to describe the ability of the forecast to sort a set of observed events into subsets with different frequency distributions. It is also relative to the conditional distribution of the observations given the forecasts. A flood ensemble forecast system has resolution if small changes in the forecast probabilities are associated with different observed outcomes, whether or not the forecast probabilities are reliable.

**Discrimination**to describe whether the forecast system can discriminate between events and non-events. It is relative to the conditional distribution of the forecasts given the observations. It helps answer questions like: if the observations are in the flood level category, what did the forecasts predict? An ensemble forecast system is discriminatory with respect to a given flood threshold if it consistently forecasts the (observed) flood occurrence with a probability higher than chance (i.e., climatology) and consistently forecasts the (observed) non-flood event with a probability lower than chance.

**Sharpness**is an attribute of the forecasts alone and measures the tendency to predict with extreme probabilities (0 or 1). A high degree of sharpness is only desirable in the context of other measures improving upon climatology (i.e., an unsharp forecast). For example, without reliability, sharp forecasts are misleading.

Mathematically (see e.g., Bradley et al. 2003 and 2004), forecast quality consists in examining the joint probability distribution function (pdf) of the forecasts, Y, and observations, X, f _{XY }( x, y ).

The joint distribution can be factored into (Murphy and Winkler, 1987):

- f
_{X|Y }( x | y ) ∙ f_{Y}( y ), known as the “calibration-refinement” factorization, - f
_{Y|X}( y | x ) ∙ f_{X}( x ), known as the “likelihood-base rate” factorization.

Differences between the marginal distributions f_{X }( x ) and f_{Y }( y ) describe the unconditional biases in the forecast probabilities. The comparison of the conditional pdf f_{X|Y }( x | y ) with f_{Y }( y ) describes the (conditional) reliability of the forecast probabilities.

The forecast resolution concerns only the sensitivity of the conditional pdf f_{X|Y }( x | y ) to f_{Y }( y ), without being affected by the consistency between pdf f_{X|Y }( x | y ) and f_{Y }( y ).

Note that, for a given level of reliability, forecasts that contain less uncertainty, i.e., “sharp forecasts”, may be preferred over “unsharp” ones, since they contribute less uncertainty to decision making (Gneiting et al., 2007).

The comparison of the conditional pdf f_{Y|X }( y | x ) with f_{X }( x ) describes the ability of the forecasts to discriminate between different observed outcomes.

In general, several forecast attributes are important for a forecasting system to be useful for end users. However, for a particular application of the forecasts, some attributes of forecast quality may be more important than others. Data visualization of forecast and observed quantities will also help identify possible weaknesses in the forecast (as well as data issues).

**What are the most commonly used verification metrics?**

A wide range of verification metrics has emerged in the atmospheric sciences and, more recently, in other areas, such as hydrology (e.g., Brown et al. 2010, Liu et al. 2011, Zappa et al. 2012). The Joint Working Group on Forecast Verification Research from the World Weather Research Programme (WWRP) and the Working Group on Numerical Experimentation maintains a reference website describing standard and newly-developed verification metrics, as well as freely available verification tools and packages.

The following table includes standard verification metrics used in operational hydrometeorological forecasting. These metrics can be applied to:

**single-valued forecasts**, which could be either deterministic forecasts or best single-valued estimates from ensemble forecasts,

**probabilistic forecasts**, probabilities being derived from ensembles or from probability distribution functions.

Some of the metrics refer to forecasts of **discrete events** (e.g., defined as the variable exceeding a threshold). Both single-valued and probabilistic forecasts can define the exceedance of one or multiple discrete thresholds. For **dichotomous forecasts** that concern only one discrete event (e.g., the occurrence of a flood), one can define the contingency table, which lists the numbers of hits, misses, false alarms, and correct negatives (for a given probability level in the case of probabilistic forecasts). A number of metrics (e.g., Critical Success Index, Probability Of Detection) can be derived from the contingency table. For verifying probabilistic forecasts for one discrete event for all probabilities, the Brier Score measures the mean square error of the forecast probabilities where the observations are either 0 (no occurrence) or 1 (occurrence).

*Table of standard verification metrics commonly used in operational hydrometeorological forecasting (see here for further details).*

Quality attribute |
Metric name |
Type of forecast |
Discrete events? |

Error | Mean Absolute Error | Single-valued | No |

Mean Square Error | Single-valued | No | |

Root Mean Square Error | Single-valued | No | |

Mean Continuous Rank Probability Score (CRPS) | Probabilistic | No | |

Brier Score | Probabilistic | Yes | |

Critical Success Index (or Threat Score) | Both | Yes | |

Bias | Relative Mean Error (or Relative Bias) | Single-valued | No |

Frequency Bias | Both | Yes | |

Correlation | Pearson Correlation Coefficient | Single-valued | No |

Spearman Rank Correlation | Single-valued | No | |

Skill | Mean Absolute Error Skill Score | Single-valued | No |

Mean Square Error Skill Score | Single-valued | No | |

Mean Continuous Rank Probability Skill Score | Probabilistic | No | |

Brier Skill Score | Probabilistic | Yes | |

Equitable Threat Score (or Gilbert Skill Score) | Both | Yes | |

Reliability (conditioned on forecast) | Mean CRPS Reliability | Probabilistic | No |

Brier Score Reliability | Probabilistic | Yes | |

Reliability Diagram | Probabilistic | Yes | |

Rank Histogram | Probabilistic | Yes | |

Success Ratio | Both | Yes | |

Resolution (conditioned on forecast) | Mean CRPS Resolution | Probabilistic | No |

Brier Score Resolution | Probabilistic | Yes | |

Discrimination (conditioned on observation) | Relative Operating Characteristic Score | Both | Yes |

Relative Operating Characteristic Diagram | Both | Yes | |

Probability Of Detection (or Hit Rate) | Both | Yes | |

Probability Of False Detection (or False Alarm Rate) | Both | Yes | |

Sharpness | Forecast Frequency Histogram | Probabilistic | Yes |

The complete list of references can be found here.

This post is a contribution to the new HEPEX Science and Implementation Plan.

See also in “HEPEX Science and Challenges: Verification of Ensemble Forecasts”:

- What do we mean by forecast verification? What are the goals of verification?

- What practitioners and researchers should consider when verifying forecasts?
- What are the challenges and research needs in ensemble forecast verification?