Thursday, July 5, 2007

Refining the Model Episode II - Attack of the Current Inputs

To bring things up to speed: the following table lists the current subset of possible inputs that have the best correlation with the final score margin and provide the best predictions. If you are a careful enough reader, you'll notice some new inputs, which I'll explain in a second. All of the inputs are expressed in terms of value over league average, which is a percentage.

Abbreviation key:
H = Home, A = Away, O = Offense, D = Defense, M = Made, A = Allowed, G = Given, T = Taken
R = Rush, P = Pass, SR = Sack Rate, 3C = Third down conversion rate, IR = Interception rate, FR = Fumble rate
v = versus (Home VOLA - Away VOLA)
U = Unadjusted, A = Adjusted for opponent quality (relative to league average)

Unadj/AdjInputCorrelation with Margin

As you can see, the passing stats are now more highly correlated with the margin than the running stats. NFL Stats pointed out that yards per attempt statistics are more significant than yards per game stats. Also, a yards-per-pass attempt stat should include the yards lost in sacks. I'll get to some of the correlations of my inputs with win totals at the end of the post, as they back up those assertions.

Without getting into gross detail, using VOLA instead of raw stats bumped up the correlation coefficients slightly but consistently, and adjusting for opponent quality significantly increased the correlation coefficients for rushing and passing inputs. It's interesting to note that the home team's quality is more important (in terms of the correlation) than that of the away team. In the turnover stats, the home team's ability to pick off passes is more important than their own QB's ability to not throw interceptions. Similarly, their passing game and rushing game is more important than those of the away team. This could be another effect of teams simply performing better at home and another justification for adjusting stats to account for home field advantage. That article will come some time next week.

Since May, I've made the following changes to the data set:

  • No more punt return data. Too lowly correlated.
  • Tried kickoff return data. Same problem.
  • Penalty first downs given to opponent and penalty yards lost. Per game. Same problem.
  • Third down conversion data. Being able to sustain drives by converting third downs is, of course, important. And 3rd down attempts are frequent enough to justify adding an input, unlike fourth downs attempts.
  • Rushing stats are based on yards per carry. Adjust for opponent in similar fashion to sack rates (adjusting rate/average based on league average rather than totals).
  • Passing stats are now yards per attempt but yards lost on sacks are no longer added back to totals.
  • Instead of a broad turnover ratio, I'm using interception rates and fumble rates. As it turns out, a stat of the combined VOLAs of the inputs listed in the table above has a .13 correlation coefficient, about as high as the turnover ratio. Like sack rates, it makes more sense to judge a team by how often they throw picks rather than how many they throw.

Correlation Coefficient of Year-End Stats with Season Win Totals
StatUnadj RawUnadj VOLAAdj RawAdj VOLA

PR = Punt Return, PC = Punt Coverage (yards per punt return), KR = Kick Return, KC = Kick Coverage (yards per kickoff), PFD = Penalty First Downs, PY = Penalty Yards

Using unadjusted sack, punt, kick, penalty, and turnover data along with adjusted rush, pass, and third down conversion data, the model has an average R2 of 0.7995 and average mean absolute error of 1.38 wins. The predicted win totals have a correlation coefficient of 0.83151 with the actual win totals.

Where to Go from Here/A Preview of Upcoming Work

  • Better special teams stat. I was thinking of using average starting field position after punts and kickoffs, which I can get from's game books.
  • Try a penalty rate stat. This would have to include all defensive plays as well. How often does a team make a mental mistake and how costly is it overall?
  • Try to adjust stats to account for home field advantage.
  • Try to adjust stats for conference.
  • Retry climate variables. Perhaps a ternary variable for each climate matchup. 0=N/A. 1=Applicable. During weeks 1-8. 2=Applicable. During weeks 9-17.

Numbers corrected on 7/12/2007 after finding errors in some box scores.


Brian Burke said...

My philosophy is to use the simplest, most basic, and representative stats possible for predicting games.

For example, consider 1st down rates. 1st down rates are simply a function of run yds/att and pass yds/att. It's not a perfect correlation, but the remaining variance is due to luck. And we don't want luck in our model.

The same holds true for things like red zone TDs or TD rates. It's a function of (mostly) pass efficiency and (to a lesser degree) run efficiency. If you include passing and running stats, AND red zone effectiveness in a model, you're confounding collinear stats. Restrospectively, your model fit will appear stronger because it will capture the results of luck, but your predicitve capability would be reduced.

This is the mistake that most mathematical precition models make. They take the "kitchen sink" approch.

By luck, I don't mean freak wind gusts or anything so supernatural. I'm talking about natural mathematical phenomena. My best example illustrates what I call "bunching."

Take a game between CLE and PIT where both teams are evenly matched. Both teams earn 12 first downs. CLE has 2 drives of 4 1st downs culminating in 2 TDs, and a few other drives resulting in punts. PIT has 6 drives consisting of 2 1st downs followed by punts. The final score is: CLE 14, PIT 0.

Both teams performed equally well, but CLE's 1st downs came in bunches. Baseball works roughly the same way, where teams with the same number of hits can have dramatically different run totals because the losing team's hits are "spread out."

I believe we have to accept that a certain percentage of football outcomes are due to luck. When I first set out to model the NFL, I thought a perfect model could get to an r-squard of 1.00. I was a deterministic fool. Now, I'm guessing more like 0.8 or so. We have to give up on determinism.

Here's some geeky history for you. One reason the Soviets lost the cold war was that their ideology was based on a determistic future. To them, it was destiny for communism to sweep mankind.

Their ideology bled into their research. On a scientific level, they believed that with known initial conditions of all the particles in the universe, they could extrapolate the future. When quantum theory came along and blew that concept away, the commies resisted. An electron is not "there," it is "probably around there."

Since you're a computer science guy, you may already know that one of the most significant inventions of modern engineering is based on quantum theory--the microprocessor. That's why western military technology outpaced the Soviets' so quickly.

The point is, don't be so conerned about retrospective model fit. Worry about the purity of the independent variables of your model. Then accept that the rest is due to luck, or at least to things we can't effectively measure.

Derek said...

I've heard the "kitchen sink" problem also called "the curse of dimensionality." The more dimensions you introduce into a problem, the less space you cover in the state space. If your data covers 50% of the possible values in each dimension, then 25% of the space is covered in a 2-dimension problem. In a 4-dimension problem, only 6.25% of the space is covered.

While I want to keep the problem as simple as possible, I do want to balance that with how many aspects of performance I capture. For instance, Football Outsiders has found that teams that perform better on third downs than first and second downs, usually improve the next year, and the converse holds as well. So is there some aspect of coaching that is being captured by the third-down conversion rate? Perhaps a team with poor run blocking but a fast running back will do more poorly on short-yardage third downs but better on other downs, as defenses will more likely be expecting the run. With "bunching" of hits, perhaps batting order plays a noticable and consistent part in it. Of course, there's going to be some co-linearity between rush and pass efficiencies and third-down conversion rates, but does it necessarily make for a worse model? It doesn't seem to. I assume that you tested this with your methods and found that it didn't work, however.

I agree that natural streakiness can swing several games a season for a team, so the R-square will never be 1. Clearly, however, it can be quite high. I wouldn't be surprised if it could exceed .9 or .95 with better stats.