The Initial Research
The website built for the original research. Contains graphs, tables, and more complete descriptions.
Rather than predict the exact final score of games, which is dependent on a good deal of random factors, I tried to create a system that could predict the margin of victory/defeat for the home team. The idea being that the better team will win and the better they are, the more points they'll win by. Again, random factors do play a part in the final score margin, but it at least reduces the number of possible outcomes to approximately 93. From 1994-2006, the maximum margin of victory was 49 points, and the maximum margin of defeat was 43 points. Essentially, given what we know about the two teams, I'm trying to see what the expected margin of victory/defeat, the average of the possible outcomes weighted by their probability, is.
Using box scores from this site, I gathered statistics I would use as inputs: rush offense vs. defense (yards per game), pass offense vs. defense (YPG), punt return vs. return coverage (yards per return), sack rate made vs. sack rates allowed (sacks per pass play), time of possession (season average), and turnover ratio (season cumulative). Rather than look at each metric individually, I wanted to compare a team's offense with its opponent's defense in an input to keep in line with the output of "how much better do I expect this team to be." Football Outsiders being a major inspiration for the research, I didn't want to use YPG statistics, but I was forced to. To capture some of what they did, I took the statistics (except TOP and turnovers) and turned them into values over league average.
Value over league average = (Team average - League average) / League average
For defensive measurements, where giving up fewer yards than league average is desirable, I simply multiplied VOLA by -1.
In a similar fashion, I tried adjusting for opponent quality, weighting a team's performance in each game by the opponent's performance over league average. If the opponents gain 150 yards per game rushing and the league average is 100 yards, holding them to 150 yards rushing is an average performance rather than below average. Each yard allowed is worth only 100/150 = 2/3 of a yard. Thus...
Adjusted VOLA = (Team's adjusted average - League average) / League average
If the equations are just stating the obvious, my apologies. To compare home team rush offense with away team rush defense, I merely subtracted the away team's rush defense AVOLA from the home team's rush offense VOLA. Same with the converse and for pass offense, pass protection, and punt returns.
In addition to the statistics, I also used home field climate as an input. Based on Football Outsiders' work, I divided home field climate into 4 types: warm, cold, dome, and Denver (high altitude). For each team, there are 4 binary (0 or 1/true or false) variables representing the 4 possible climate types.
So we have all of these statistics. What do we do with them? Well, as I was studying artificial intelligence when I did this, I went for methods such as artificial neural networks and support vector machines. The methods were tested on the years 2000-2006, one at a time, and training on all years previous to the test set year. I also tried comparing my results against the spread, which similarly represents the expected margin of victory/defeat.
All methods, including the spread, had the following weaknesses:
The methods I used ran into the following problems:
Other highly correlated inputs were the turnover ratios and the sack rate metrics.
Specific numbers are given here, but overall, the spread was clearly the best predictor in terms of the following metrics:
Using the spread as an extra input, I could get better results in one or two areas in most years, but the spread was clearly carrying the load. Support vector machines with the spread did the best overall, giving stable predictions (unlike neural networks), but they're not easily interpretable models. Linear regression without the spread was 57-63% accurate, which was on the lower end of performance, but it's a model that is easily interpretable.
Where to go from here
For interpretability issues, I'm going to be mainly experimenting with linear regression. The tradeoff in accuracy isn't worth it at this point.
If the spread's the best predictor, then I think trying to model the spread could yield some knowledge that leads toward better predictions. It's a simple matter of replacing the final score margin with the spread as the output I'm trying to predict. I'll be following up on this soon.
The bias towards the home team in all of the methods needs to be taken down a notch. In the case of linear regression, 3.2-3.6 points were being added in favor of the home team automatically (via bias term), which is up to a point more than what should be added on average. Rather than using the bias linear regression comes back with, I could use the actual average result for that year, so in 2006, when the average result was 0.84 points in favor of the home team, an extra 2 points wouldn't be added. The bias could also be adjusted for the home field climate type, eliminating the need for the binary variables.
Most importantly, better inputs are needed. All of the computing power in the world isn't going to help otherwise. From the box scores, I can also include kickoff returns and third down conversion rates. I'm also going to follow up on that soon. Other than that, I think something like Football Outsiders' DVOA statistics are necessary. The DVOA stats take the context of every play into account, filter out random noise, are adjusted for opponent quality, and break down into very specific situations and for specific players and groups of players. DVOA for the pass defense actually measures quality of the pass defense, unlike the YPG stats.
As long as this article is, I glossed over a good deal in terms of specific results. To put it shortly, we could do better. In a way, I spent several months discovering what I pretty much knew already: YPG stats aren't very valuable, warm-weather teams have trouble in cold weather. But it was nevertheless interesting to quantify things like home-field advantage (more on this coming). The research is going to need time to evolve, and my resources are limited. Can't be afraid to fail.
2 comments:
Derek,
Admittedly I know very little about neural networks or support vector machines. And by 'very little,' I mean nothing.
But I do know that one of the requirements for valid linear regression is a normally distributed continous dependent variable.
Real game point spreads are neither continuous nor normal. I can easily accept the continuous shortcoming, that's pretty common.
But the distribution of point spreads is very far from normal. There are lots more 3, 7, 10, or 13 point spreads than 4, 5, 8 point spreads because of the way football is scored.
Very good point about the normal distribution. I might have posted this in a later entry, but in terms of the standard deviation, the spreads actually follow 68-95-99.7 rule. Of course, this does not mean that it's normally distributed. Linear regression works well on many problems because many problems' true distributions are similar to (though not actually) normal. Though with about 95 actual outcomes/discrete classes (many of which with few examples), I just thought it was easier to treat it like a continuous variable.
The point of using NNs and SVMs was to introduce some non-linearity to the approximated function. You could certainly just use non-linear variables in a linear or logistic regression instead, which you've tried.
Post a Comment