Refining the Model Episode II - Attack of the Current Inputs
To bring things up to speed: the following table lists the current subset of possible inputs that have the best correlation with the final score margin and provide the best predictions.  If you are a careful enough reader, you'll notice some new inputs, which I'll explain in a second.  All of the inputs are expressed in terms of value over league average, which is a percentage.
Abbreviation key:
H = Home, A = Away, O = Offense, D = Defense, M = Made, A = Allowed, G = Given, T = Taken
R = Rush, P = Pass, SR = Sack Rate, 3C = Third down conversion rate, IR = Interception rate, FR = Fumble rate
v = versus (Home VOLA - Away VOLA)
U = Unadjusted, A = Adjusted for opponent quality (relative to league average)Unadj/Adj Input Correlation with Margin A HROvARD 0.12565 A AROvHRD -0.070102 A HPOvAPD 0.20396 A APOvHPD -0.17716 U HSRMvASRA 0.12016 U ASRMvHSRA -0.12116 U H3CMvA3CA 0.17253 U A3CMvH3CA -0.12753 U HIRGvAIRT 0.076802 U AIRGvHIRT -0.13572 U HFRGvAFRT 0.11939 U AFRGvHFRT -0.10935 
As you can see, the passing stats are now more highly correlated with the margin than the running stats.  NFL Stats pointed out that yards per attempt statistics are more significant than yards per game stats.  Also, a yards-per-pass attempt stat should include the yards lost in sacks.  I'll get to some of the correlations of my inputs with win totals at the end of the post, as they back up those assertions.
Without getting into gross detail, using VOLA instead of raw stats bumped up the correlation coefficients slightly but consistently, and adjusting for opponent quality significantly increased the correlation coefficients for rushing and passing inputs.  It's interesting to note that the home team's quality is more important (in terms of the correlation) than that of the away team.  In the turnover stats, the home team's ability to pick off passes is more important than their own QB's ability to not throw interceptions.  Similarly, their passing game and rushing game is more important than those of the away team.  This could be another effect of teams simply performing better at home and another justification for adjusting stats to account for home field advantage.  That article will come some time next week.
Since May, I've made the following changes to the data set:Correlation Coefficient of Year-End Stats with Season Win Totals Stat Unadj Raw Unadj VOLA Adj Raw Adj VOLA RO 0.2052 0.21076 0.20991 0.21529 RD -0.13296 0.1418 -0.11469 0.12303 PO 0.5875 0.59369 0.61753 0.62381 PD 0.12777 -0.12764 -0.44051 0.44867 SRM 0.29453 0.30683 0.25853 0.27186 SRA -0.35812 0.36049 -0.33423 0.33633 PR 0.14976 0.15024 0.12454 0.12407 PC -0.10693 0.10282 -0.10179 0.09751 KR 0.080139 0.082233 0.029962 0.030679 KC -0.051189 0.050584 -0.052803 0.053027 3CM 0.4948 0.50087 0.47661 0.48294 3CA -0.33792 0.34373 -0.27612 0.2809 PFD -0.090883 0.090432 -0.061634 0.060372 PY -0.11465 0.1219 -0.10817 0.11399 IRG -0.38033 0.38235 -0.36197 0.36454 IRT 0.34652 0.3513 0.28993 0.29535 FRG -0.4307 0.43468 -0.37416 0.37877 FRT 0.34938 0.35141 0.32798 0.33012 
PR = Punt Return, PC = Punt Coverage (yards per punt return), KR = Kick Return, KC = Kick Coverage (yards per kickoff), PFD = Penalty First Downs, PY = Penalty Yards
Using unadjusted sack, punt, kick, penalty, and turnover data along with adjusted rush, pass, and third down conversion data, the model has an average R2 of  0.7995 and average mean absolute error of 1.38 wins.  The predicted win totals have a correlation coefficient of 0.83151 with the actual win totals.
Where to Go from Here/A Preview of Upcoming Work
Numbers corrected on 7/12/2007 after finding errors in some box scores.
 
2 comments:
My philosophy is to use the simplest, most basic, and representative stats possible for predicting games.
For example, consider 1st down rates. 1st down rates are simply a function of run yds/att and pass yds/att. It's not a perfect correlation, but the remaining variance is due to luck. And we don't want luck in our model.
The same holds true for things like red zone TDs or TD rates. It's a function of (mostly) pass efficiency and (to a lesser degree) run efficiency. If you include passing and running stats, AND red zone effectiveness in a model, you're confounding collinear stats. Restrospectively, your model fit will appear stronger because it will capture the results of luck, but your predicitve capability would be reduced.
This is the mistake that most mathematical precition models make. They take the "kitchen sink" approch.
By luck, I don't mean freak wind gusts or anything so supernatural. I'm talking about natural mathematical phenomena. My best example illustrates what I call "bunching."
Take a game between CLE and PIT where both teams are evenly matched. Both teams earn 12 first downs. CLE has 2 drives of 4 1st downs culminating in 2 TDs, and a few other drives resulting in punts. PIT has 6 drives consisting of 2 1st downs followed by punts. The final score is: CLE 14, PIT 0.
Both teams performed equally well, but CLE's 1st downs came in bunches. Baseball works roughly the same way, where teams with the same number of hits can have dramatically different run totals because the losing team's hits are "spread out."
I believe we have to accept that a certain percentage of football outcomes are due to luck. When I first set out to model the NFL, I thought a perfect model could get to an r-squard of 1.00. I was a deterministic fool. Now, I'm guessing more like 0.8 or so. We have to give up on determinism.
Here's some geeky history for you. One reason the Soviets lost the cold war was that their ideology was based on a determistic future. To them, it was destiny for communism to sweep mankind.
Their ideology bled into their research. On a scientific level, they believed that with known initial conditions of all the particles in the universe, they could extrapolate the future. When quantum theory came along and blew that concept away, the commies resisted. An electron is not "there," it is "probably around there."
Since you're a computer science guy, you may already know that one of the most significant inventions of modern engineering is based on quantum theory--the microprocessor. That's why western military technology outpaced the Soviets' so quickly.
The point is, don't be so conerned about retrospective model fit. Worry about the purity of the independent variables of your model. Then accept that the rest is due to luck, or at least to things we can't effectively measure.
I've heard the "kitchen sink" problem also called "the curse of dimensionality." The more dimensions you introduce into a problem, the less space you cover in the state space. If your data covers 50% of the possible values in each dimension, then 25% of the space is covered in a 2-dimension problem. In a 4-dimension problem, only 6.25% of the space is covered.
While I want to keep the problem as simple as possible, I do want to balance that with how many aspects of performance I capture. For instance, Football Outsiders has found that teams that perform better on third downs than first and second downs, usually improve the next year, and the converse holds as well. So is there some aspect of coaching that is being captured by the third-down conversion rate? Perhaps a team with poor run blocking but a fast running back will do more poorly on short-yardage third downs but better on other downs, as defenses will more likely be expecting the run. With "bunching" of hits, perhaps batting order plays a noticable and consistent part in it. Of course, there's going to be some co-linearity between rush and pass efficiencies and third-down conversion rates, but does it necessarily make for a worse model? It doesn't seem to. I assume that you tested this with your methods and found that it didn't work, however.
I agree that natural streakiness can swing several games a season for a team, so the R-square will never be 1. Clearly, however, it can be quite high. I wouldn't be surprised if it could exceed .9 or .95 with better stats.
Post a Comment