Contributed by Julie Demargne and James Brown
What practitioners and researchers should consider when verifying forecasts?
As with any statistical analysis, the information is meaningful when the verification analysis is conducted with a large sample of consistent forecasts and observations. Therefore verification requires either archiving for multiple years forecasts produced by a fixed forecast system, or hindcasting to retroactively produce forecasts from a fixed version of the forecast system.
Furthermore, verification results should be reported along with the sample sizes since the sampling uncertainty could have an important impact on the values of the verification statistics for small sample sizes (which is usually the case for extreme events). Sampling uncertainty of the verification metrics should be evaluated to answer questions such as:
- what is the uncertainty associated with the value of a verification measure?
- is forecast A significantly different (in terms of some quality metric) than forecast B, given the sampling uncertainty?
Sampling uncertainty in a specific verification metric may be estimated by computing confidence intervals via analytic, approximate, or bootstrapping methods (see e.g., Bradley et al. 2003, Jolliffe 2007). Confidence intervals are random intervals with a specified level of confidence (e.g., 95%, as recommended in WMO, 2008) of capturing a sample value of the metric. Note that confidence intervals contain more information about the sampling uncertainty than a simple significance test.
Forecast quality will usually depend on the forecast situation or conditions. Therefore it may be useful to analyze how forecast quality varies when stratifying the forecast-observed dataset. Data stratification could be achieved through temporal or spatial conditioning (e.g., by season or region) and atmospheric/hydrologic conditioning (e.g., occurrence of precipitation, freezing level, and flooding level). In order to inter-compare results from different locations, it is also useful to define common probability thresholds (e.g., from the observed probability distribution) rather than absolute or impact-based thresholds. Note that data stratification should involve categories with reasonable sample sizes to obtain reliable verification statistics for each category.
Forecast quality may also vary strongly with:
- human and modelling factors,
- forecast lead time,
- forecast location,
- temporal scale (e.g., 6-hr flow vs. monthly volume),
- and spatial scale (e.g., flash flood basin vs. major river system).
Therefore verification metrics may be analyzed for different forecast products, for individual forecast location, and as aggregated statistics for meaningful and homogeneous sub-groups of forecast points (e.g., sharing similar characteristics such as the hydrologic regime), as well as for aggregated forecasts at different temporal scales. Temporal aggregation requires defining a new variable, such as minimum, maximum, average, and total, with a different time step from that of the original forecasts.
Despite the common underlying mathematics, verification of hydrological and meteorological forecasts involves significant practical differences (e.g., Pappenberger et al. 2008) since:
- flow or stage forecasts are generally verified for irregular hydrological units (e.g., basins);
- flow or stage forecasts are impacted by the quality of various hydrometeorological inputs on the hydrological unit, thus requiring the verification of the different hydrometeorological inputs (at the considered hydrologic scale) with the observations used in the hydrologic models or application;
- hydrologic forecasting requires integrating different hydrometeorological inputs that are consistent across a range of space and time scales, from local to national scale, and from short to long term;
- and uncertainty in flow or stage forecasts comes from various sources; sensitivity analysis of the different sources of uncertainty relies on using different forecasting scenarios.
The complete list of references can be found here.
This post is a contribution to the new HEPEX Science and Implementation Plan.
See also in “HEPEX Science and Challenges: Verification of Ensemble Forecasts”: