Yesterday, I published an update to my NFL Elo rating model, and today I published the "after week 6" ratings and rankings.
In this post, I'd like to compare the accuracy stats for three different models.
Edit 2023-11-12: this post has been updated with the "margin of victory differential" table converted back to football points, which reveals that the "Bigger K" model is in fact better both in terms of picking winners and for estimating margins of victory.
For games since week 1 of the 2013 season, below we see the number of game winners correctly picked by each model. To compare early-season vs late-season, this data is broken down by week. The first row, for example, shows the number of game winners correctly picked on all weeks (on and after week 1). The 8th row only counts games on and after week 8, about mid-way through the season.
Number of Game Winners Picked by Model | ||||
---|---|---|---|---|
Games Count | Game Weeks | Bigger K | Original | Blank Slate |
2800 | on+after wk. 1 | 1832 | 1832 | 1724 |
2625 | on+after wk. 2 | 1722 | 1718 | 1638 |
2449 | on+after wk. 3 | 1609 | 1606 | 1534 |
2273 | on+after wk. 4 | 1500 | 1504 | 1438 |
2106 | on+after wk. 5 | 1397 | 1396 | 1346 |
1945 | on+after wk. 6 | 1290 | 1288 | 1250 |
1786 | on+after wk. 7 | 1188 | 1186 | 1158 |
1643 | on+after wk. 8 | 1091 | 1090 | 1064 |
1502 | on+after wk. 9 | 988 | 983 | 974 |
1369 | on+after wk. 10 | 909 | 904 | 896 |
1231 | on+after wk. 11 | 830 | 827 | 817 |
1090 | on+after wk. 12 | 728 | 728 | 719 |
937 | on+after wk. 13 | 622 | 626 | 609 |
782 | on+after wk. 14 | 520 | 524 | 513 |
627 | on+after wk. 15 | 420 | 421 | 415 |
467 | on+after wk. 16 | 311 | 313 | 308 |
We can see that the "Bigger K" model picks the same number of game winners as the "Original" model, and generally does a better job toward the beginning and middle of the season while the "Original" model does a better job picking winners in the last few weeks of the season.
The "Blank Slate" model picks far fewer game winners correctly in the beginning of the season, but by the end of the season, as expected, performs similarly to the other models (though still a bit worse).
The "Margin of Victory Differential" table below shows how accurate each model's expected margin of victory is. It shows the median value over all tested games, again since week 1 of the 2013 season.
The "differential" is between the expected margin of victory and the actual margin of victory (in football points) for each football game. The model uses each team's rating, home-vs-away, etc., to calculate an expected "Elo score" for both teams. The "Elo score" is on a 0.0 to 1.0 scale. A team that gets "blown out" earns a 0.0, teams that tie earn a 0.5, and a team that wins in a blowout earns a score of 1.0. The sum of both teams' scores is 1.0. Using each model's parameters for "close victory" and "blowout," these 0.0-1.0 expected scores are converted to football points.
For example, if "team A" is expected to win by 7 points, and ends up winning the game by 10 points, the "margin of victory differential" is 3 points. If "team A" ends up winning by 3 points, the differential is 4 points. The differential also includes the expected winner vs actual winner: if "team A" is expected to win by 3 points, and ends up actually losing by 3 points, the "margin of victory differential" is 6 points.
A median "Margin of victory differential" of 8.0 means that just as many games had an actual margin of victory fewer than 8 points away from the estimated margin as had more. In other words, the estimated margins of victory/defeat were fewer than 8 points away from the actual margins of victory/defeat just as often as they were more than 8 away from the actual margin. (If I think of a better way to word this I'll edit this post.)
Median Margin of Victory Differential by Model | ||||
---|---|---|---|---|
Games Count | Game Weeks | "Bigger K" | "Original" | "Blank Slate" |
2800 | on+after wk. 1 | 8.0 | 8.0261 | 8.0636 |
2625 | on+after wk. 2 | 8.0417 | 8.0915 | 8.1326 |
2449 | on+after wk. 3 | 8.1496 | 8.1194 | 8.2433 |
2273 | on+after wk. 4 | 8.078 | 8.0382 | 8.1103 |
2106 | on+after wk. 5 | 8.0 | 8.0029 | 8.049 |
1945 | on+after wk. 6 | 8.0 | 8.0114 | 8.1113 |
1786 | on+after wk. 7 | 8.0 | 8.016 | 8.1296 |
1643 | on+after wk. 8 | 8.0 | 8.0 | 8.0604 |
1502 | on+after wk. 9 | 8.0 | 8.0081 | 8.0829 |
1369 | on+after wk. 10 | 8.0 | 8.0 | 8.1103 |
1231 | on+after wk. 11 | 7.9073 | 8.0 | 8.0604 |
1090 | on+after wk. 12 | 8.0106 | 8.0757 | 8.3364 |
937 | on+after wk. 13 | 8.2963 | 8.1807 | 8.3774 |
782 | on+after wk. 14 | 8.3071 | 8.2555 | 8.3521 |
627 | on+after wk. 15 | 8.1665 | 8.1368 | 8.3381 |
467 | on+after wk. 16 | 8.4544 | 8.4859 | 8.654 |
The new "Bigger K" model is a little better at predicting margins of victory/defeat than the "Original" model. The "Blank Slate" model surprisingly gets worse toward the end of the season, and overall it performs the worst. The "Bigger K" model is the best at most points in the season.
I'm not sure how meaningful this margin of victory differential is. It seems to me that predicting blowouts properly, and close games, is a sign of an accurate model so that's why I've included it in the post and why I use it when trying to tune a model for accuracy.
The "Blank Slate" model is fun to look at, but is clearly less accurate. It picks 108 fewer game winners than the other models since 2013.
The "Original" model is more conservative. It effectively gives historically "good" teams the benefit of the doubt, and doesn't drop their Elo rating too much after a few bad games. As a result, it tends to be a hair more accurate in the later stages of the season than the "Bigger K" model.
The "Bigger K" model however reacts more strongly to individual games, and passes the "eye test" much more convincingly than the "Original" model, especially in the early stages of a season. Accordingly, it does pick more winners in the beginning and middle of the season than the "Original" model. It also is the best model at estimating margin of victory.
Again, here is the power rankings page for the 2023 season. Oh, and here are the 2023-only rankings that use the "Blank Slate" model.