NFL Elo Models Compared

Yesterday, I published an update to my NFL Elo rating model, and today I published the "after week 6" ratings and rankings.

In this post, I'd like to compare the accuracy stats for three different models.

Edit 2023-11-12: this post has been updated with the "margin of victory differential" table converted back to football points, which reveals that the "Bigger K" model is in fact better both in terms of picking winners and for estimating margins of victory.

The Models

The Error Rates

Number of Game Winners Picked

For games since week 1 of the 2013 season, below we see the number of game winners correctly picked by each model. To compare early-season vs late-season, this data is broken down by week. The first row, for example, shows the number of game winners correctly picked on all weeks (on and after week 1). The 8th row only counts games on and after week 8, about mid-way through the season.

Number of Game Winners Picked by Model
Games CountGame WeeksBigger KOriginalBlank Slate
2800on+after wk. 1183218321724
2625on+after wk. 2172217181638
2449on+after wk. 3160916061534
2273on+after wk. 4150015041438
2106on+after wk. 5139713961346
1945on+after wk. 6129012881250
1786on+after wk. 7118811861158
1643on+after wk. 8109110901064
1502on+after wk. 9988983974
1369on+after wk. 10909904896
1231on+after wk. 11830827817
1090on+after wk. 12728728719
937on+after wk. 13622626609
782on+after wk. 14520524513
627on+after wk. 15420421415
467on+after wk. 16311313308

We can see that the "Bigger K" model picks the same number of game winners as the "Original" model, and generally does a better job toward the beginning and middle of the season while the "Original" model does a better job picking winners in the last few weeks of the season.

The "Blank Slate" model picks far fewer game winners correctly in the beginning of the season, but by the end of the season, as expected, performs similarly to the other models (though still a bit worse).

Margin of Victory Differential

The "Margin of Victory Differential" table below shows how accurate each model's expected margin of victory is. It shows the median value over all tested games, again since week 1 of the 2013 season.

The "differential" is between the expected margin of victory and the actual margin of victory (in football points) for each football game. The model uses each team's rating, home-vs-away, etc., to calculate an expected "Elo score" for both teams. The "Elo score" is on a 0.0 to 1.0 scale. A team that gets "blown out" earns a 0.0, teams that tie earn a 0.5, and a team that wins in a blowout earns a score of 1.0. The sum of both teams' scores is 1.0. Using each model's parameters for "close victory" and "blowout," these 0.0-1.0 expected scores are converted to football points.

For example, if "team A" is expected to win by 7 points, and ends up winning the game by 10 points, the "margin of victory differential" is 3 points. If "team A" ends up winning by 3 points, the differential is 4 points. The differential also includes the expected winner vs actual winner: if "team A" is expected to win by 3 points, and ends up actually losing by 3 points, the "margin of victory differential" is 6 points.

A median "Margin of victory differential" of 8.0 means that just as many games had an actual margin of victory fewer than 8 points away from the estimated margin as had more. In other words, the estimated margins of victory/defeat were fewer than 8 points away from the actual margins of victory/defeat just as often as they were more than 8 away from the actual margin. (If I think of a better way to word this I'll edit this post.)

Median Margin of Victory Differential by Model
Games CountGame Weeks"Bigger K""Original""Blank Slate"
2800on+after wk. 18.08.02618.0636
2625on+after wk. 28.04178.09158.1326
2449on+after wk. 38.14968.11948.2433
2273on+after wk. 48.0788.03828.1103
2106on+after wk. 58.08.00298.049
1945on+after wk. 68.08.01148.1113
1786on+after wk. 78.08.0168.1296
1643on+after wk. 88.08.08.0604
1502on+after wk. 98.08.00818.0829
1369on+after wk. 108.08.08.1103
1231on+after wk. 117.90738.08.0604
1090on+after wk. 128.01068.07578.3364
937on+after wk. 138.29638.18078.3774
782on+after wk. 148.30718.25558.3521
627on+after wk. 158.16658.13688.3381
467on+after wk. 168.45448.48598.654

The new "Bigger K" model is a little better at predicting margins of victory/defeat than the "Original" model. The "Blank Slate" model surprisingly gets worse toward the end of the season, and overall it performs the worst. The "Bigger K" model is the best at most points in the season.

I'm not sure how meaningful this margin of victory differential is. It seems to me that predicting blowouts properly, and close games, is a sign of an accurate model so that's why I've included it in the post and why I use it when trying to tune a model for accuracy.

Conclusions

The "Blank Slate" model is fun to look at, but is clearly less accurate. It picks 108 fewer game winners than the other models since 2013.

The "Original" model is more conservative. It effectively gives historically "good" teams the benefit of the doubt, and doesn't drop their Elo rating too much after a few bad games. As a result, it tends to be a hair more accurate in the later stages of the season than the "Bigger K" model.

The "Bigger K" model however reacts more strongly to individual games, and passes the "eye test" much more convincingly than the "Original" model, especially in the early stages of a season. Accordingly, it does pick more winners in the beginning and middle of the season than the "Original" model. It also is the best model at estimating margin of victory.

Again, here is the power rankings page for the 2023 season. Oh, and here are the 2023-only rankings that use the "Blank Slate" model.