Blog post contributed by: Tim DelSole*
The Sign Test
Is one forecast model better than another? A natural approach to answering this question is to run a set of forecasts with each model and then see which set has more skill. This comparison requires a statistical test to ensure that the estimated difference represents a real difference in skill, rather than a random sampling error. Unfortunately, there are three problems with using standard difference tests: they have low statistical power, they assume uncorrelated forecast errors, and their significance levels are wrong.
As a simple example, consider a (typical) seasonal prediction system with a 20-year re-forecast data set and a correlation skill 0.5. To detect an improvement in skill using standard tests, the new seasonal prediction system would need a correlation skill exceeding 0.84, or a mean squared error that is lower than the original by a factor of about 2.5! Such gains are inconceivable in seasonal prediction. Even worse, forecast errors tend to be correlated in typical comparison studies, which violates a fundamental assumption in the difference-in-skill test (see DelSole and Tippett, 2014, 2016).
Interestingly, economists have studied this problem quite extensively and developed several methods for comparing forecasts. A good starting point is the seminal paper Diebold and Mariano (1995). A key concept in effective skill comparison is to examine paired differences. That is, instead of computing the skill of each forecast separately and then taking their difference, you compute the difference in forecast performance first, for each event, and then aggregate those differences.
For example, suppose you have a criterion for deciding which model makes a “better” forecast of an event. Then, if two models are equally skillful, the probability that one model beats the other is 50%. At this point, the problem looks exactly like the problem of deciding if a coin is fair. As everyone knows, you test if a coin is fair by flipping it several times and checking that about half of the tosses are heads. More precisely, the number of heads follows a binomial distribution with p = 1/2, and the null hypothesis is rejected if the number of heads falls outside the 95% interval computed from that distribution. Similarly, to compare the skill of models A and B, you simply count the number of times model A produces a better forecast than B, and compare that count to the 95% interval.
The above test is called the sign test because it can be evaluated simply by computing differences in some measure and then counting the number of positive signs. It has been applied in weather prediction studies (Hamill, 1999), but the simplicity and versatility of the test does not seem to be widely recognized. For instance, notice that no restrictions were imposed on the criterion for deciding the better forecast. If a forecaster is predicting a single index, then a natural criterion is to select the forecast closest to the observed value. But the test is much more versatile than this. For instance, if the forecaster is making probabilistic forecasts of binary events, then the criterion could be the forecast having the smaller Brier score. For multiple-category events, the criterion could be the forecast having smaller rank probability score. If the forecaster is predicting spatial fields, then the criterion could be the field with larger pattern correlation. For probabilistic spatial forecasts, the criterion could be the field with larger spatially-averaged Heidke skill score.
Interestingly, the problem of field significance (Livezey and Chen, 1983) is circumvented by this approach. Also, the test makes no assumptions about the error distributions, in contrast to tests based on correlation and mean square error (which assume Gaussian distributions). Finally, competing forecasts can be correlated with each other. This last point is particularly important for model development, as a change in a parameterization may change the forecast only modestly relative to the original model. Standard difference tests based on mean square error or correlation are completely useless in such cases. These concepts, and additional skill comparison tests, are discussed in more detail in DelSole and Tippett (2014).
The Random Walk Test
An informative way to summarize the results of the sign test is to display a sequence of sign tests in the form of a random walk. Figure 1 shows a schematic of the random walk test. A random walk is the path traced by a particle as it moves forward in time. Between events, the particle simply moves parallel to the x-axis. At each event, the particle moves up one unit if forecast A is more skillful than B, otherwise it moves down one unit. If the forecasts are equally skillful, then the upward and downward steps are equally probable and the average location is y = 0. Happily, the 95% interval for a random walk has a very simple form: it is ±2√N, where N is the number of independent forecast events. The hypothesis that the forecasts are equally skillful is rejected when the particle walks outside the 95% interval.
The merit in displaying the sign test as a random walk lies in the fact that the figure may reveal time-dependent variations in skill. A spectacular example of this is shown in Figure 2. This figure shows results of the random walk test comparing monthly mean forecasts of the NINO3.4 index at 2.5 month lead between CFSv2 and other models in the North American Multi-Model Ensemble. From 1982 to about 2000, the random walks drifted upward, with some models going above the 95% interval, indicating that CVSv2 was more skillful than these models. However, after 2000, there is an abrupt change in skill, so by the end of the period most models lie below the 95% interval, indicating that these models were more skillful than CFSv2.
The poor skill of CFSv2 relative to other models is widely attributed to a discontinuity in climatology due to the introduction of ATOVS satellite data into the assimilation system in October 1998, as discussed in Kumar et al. (2012), Barnston and Tippett (2013), and Saha et al. (2014). This example illustrates how the random walk test could be used routinely at operational centers for monitoring possible changes in skill due to changes in dynamical model or data assimilation system.
Several points about the random walk test are worth emphasizing. Technically, the location of the path on the very last time step dictates the final decision, because it is based on the most data. Also, the results of the sign test can be inverted to give an estimate of the probability p that one model is more skillful per event than another model. That is, the random walk test can be used to quantify the difference in skill in terms of a probability per event.
Also, the random walk test ignores the amplitude of the errors. This is both an advantage and a disadvantage. The advantage is that it allows a test to be formulated in a way that avoids distributional assumptions about forecast errors. The disadvantage is that it does not incorporate amplitude information, so a forecast might have huge busts but still be considered more skillful than another model that never busts. An alternative test based on Wilcoxon Signed-Rank test is a non-parametric test that incorporates amplitude information.
Importantly, the random walk test assumes each event is independent of the others. This assumption is not true if the initial condition between forecasts are sufficiently close. It is not clear how serial correlations or forecast calibration can be taken into account in this test, if the calibration is based on the same data as the forecast comparison data set.
More information about this test, and additional skill comparison tests, are discussed in DelSole and Tippett (2014, 2016). Also, R and Matlab codes for performing these tests can be found at http://cola.gmu.edu/delsole/webpage/SkillComparison/skillcomparison.html.
*Tim DelSole is a professor in Department of Atmospheric, Oceanic, and Earth Sciences, George Mason University; and a Senior Researcher with the Center for Ocean-Land-Atmosphere Studies.
Barnston, A. G. and M. K. Tippett, 2013: Predictions of Nino3.4 SST in CFSv1 and CFSv2: A Diagnostic Comparison. Clim. Dyn., 41, 1–19, doi:10.1007/s00382-013-1845-2.
DelSole, T. and M. K. Tippett, 2014: Comparing forecast skill. Mon. Wea. Rev., 142, 4658–4678. DelSole, T. and M. K. Tippett, 2016: Forecast comparison based on random walks. Mon. Wea. Rev., 144 (2), 615–626.
Diebold, F. X. and R. S. Mariano, 1995: Comparing predictive accuracy. J. Bus. Econ. Stat., 13, 253–263.
Hamill, T. M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Wea. Forecasting, 14, 155–167.
Kumar, A., M. Chen, L. Zhang, W. Wang, Y. Xue, C. Wen, L. Marx, and B. Huang, 2012: An analysis of the nonstationarity in the bias of sea surface temperature forecasts for the NCEP Climate Forecast System (CFS) version 2. Mon. Wea. Rev., 140, 3003–3016.
Livezey, R. E. and W. Chen, 1983: Statistical field significance and its determination by Monte Carlo techniques. Mon Wea. Rev., 111, 46–59.
Saha, S., et al., 2014: The NCEP Climate Forecast System Version 2. J. Climate, 27, 2185–2208.