Friday, June 8, 2007

The Initial Research

The website built for the original research. Contains graphs, tables, and more complete descriptions.

Rather than predict the exact final score of games, which is dependent on a good deal of random factors, I tried to create a system that could predict the margin of victory/defeat for the home team. The idea being that the better team will win and the better they are, the more points they'll win by. Again, random factors do play a part in the final score margin, but it at least reduces the number of possible outcomes to approximately 93. From 1994-2006, the maximum margin of victory was 49 points, and the maximum margin of defeat was 43 points. Essentially, given what we know about the two teams, I'm trying to see what the expected margin of victory/defeat, the average of the possible outcomes weighted by their probability, is.

Using box scores from this site, I gathered statistics I would use as inputs: rush offense vs. defense (yards per game), pass offense vs. defense (YPG), punt return vs. return coverage (yards per return), sack rate made vs. sack rates allowed (sacks per pass play), time of possession (season average), and turnover ratio (season cumulative). Rather than look at each metric individually, I wanted to compare a team's offense with its opponent's defense in an input to keep in line with the output of "how much better do I expect this team to be." Football Outsiders being a major inspiration for the research, I didn't want to use YPG statistics, but I was forced to. To capture some of what they did, I took the statistics (except TOP and turnovers) and turned them into values over league average.

Value over league average = (Team average - League average) / League average

For defensive measurements, where giving up fewer yards than league average is desirable, I simply multiplied VOLA by -1.

In a similar fashion, I tried adjusting for opponent quality, weighting a team's performance in each game by the opponent's performance over league average. If the opponents gain 150 yards per game rushing and the league average is 100 yards, holding them to 150 yards rushing is an average performance rather than below average. Each yard allowed is worth only 100/150 = 2/3 of a yard. Thus...

Adjusted VOLA = (Team's adjusted average - League average) / League average

If the equations are just stating the obvious, my apologies. To compare home team rush offense with away team rush defense, I merely subtracted the away team's rush defense AVOLA from the home team's rush offense VOLA. Same with the converse and for pass offense, pass protection, and punt returns.

In addition to the statistics, I also used home field climate as an input. Based on Football Outsiders' work, I divided home field climate into 4 types: warm, cold, dome, and Denver (high altitude). For each team, there are 4 binary (0 or 1/true or false) variables representing the 4 possible climate types.

So we have all of these statistics. What do we do with them? Well, as I was studying artificial intelligence when I did this, I went for methods such as artificial neural networks and support vector machines. The methods were tested on the years 2000-2006, one at a time, and training on all years previous to the test set year. I also tried comparing my results against the spread, which similarly represents the expected margin of victory/defeat.

All methods, including the spread, had the following weaknesses:


  1. 10 point barrier: Predictions were conistently off by at least 10 points on average from year to year. Only a couple methods on a couple years could get to a mean absolute error of 9.7-9.8 points. Interestingly, games in 1994-2006 were won by 11.329 points on average.
  2. Too many games classified as wins for home team: 58.51% of games were won by the home team from 1994-2006. Methods would regularly classify 65-80% of games as home team wins.
  3. Small range of predictions: The predictions usually ranged from about -10 to about 15. The actual range of outcomes is -43 to 49. 27.681% of games from 1994 to 2006 were won by more than 15 points. My guess is that this inflated the error, though I haven't checked this out for sure. As soon as I typed this, I put it on the to-do list, however. If you graph the actual outcomes vs. the predicted outcomes, it ends up looking like a parallelogram as seen below.
  4. Sensitive to yearly variations in home-field advantage: The average outcome of an NFL game from 1994 to 2006 was the home team winning by 2.63 points. 58.51% of those games were won by the home team. In 2006, however, only 53.125% of games were won by the home team, and the average result was the home team winning by 0.84 points. As a result, the performance of all methods, including the spread, was aberrantly poor. In 2005, the average result was the home team winning by 3.677 points, and the home team won 59.14% of games. As a result, the performance of all methods was aberrantly good.




The methods I used ran into the following problems:

  1. Some predictions close to zero: About 10-15% of the predictions were that the game would be won by less than one point. Furthermore, about half (5-8%) of those predictions were that the game would be won by less than half a point. As long as they're still not zero, they can still be used for win/loss prediction, but it just doesn't look good to say "Team A is predicted to win by 0.15 points." I'd be curious to see what the win/loss accuracy of these predictions is.
  2. Input stats have very low correlation to margin of victory/defeat: The problem with YPG stats is that they don't necessarily represent quality. Teams with comfortable leads run the ball more to eat up the clock, so the rushing metrics have the highest correlation to the margin. Correlation does not equal causation, however. It's possible the team used the passing game to rack up points quickly, and the defense took care of the rest. Conversely, teams that are behind will abandon the running game in favor of the passing game, which covers more ground in less time. Thus, the passing metrics have a low correlation to the margin.

    Other highly correlated inputs were the turnover ratios and the sack rate metrics.


Specific numbers are given here, but overall, the spread was clearly the best predictor in terms of the following metrics:


  • Win/loss accuracy
  • Average error (how many points off was it?)
  • Correlation of predictions with actual result
  • Proportion of games classified as home team wins.


Using the spread as an extra input, I could get better results in one or two areas in most years, but the spread was clearly carrying the load. Support vector machines with the spread did the best overall, giving stable predictions (unlike neural networks), but they're not easily interpretable models. Linear regression without the spread was 57-63% accurate, which was on the lower end of performance, but it's a model that is easily interpretable.

Where to go from here
For interpretability issues, I'm going to be mainly experimenting with linear regression. The tradeoff in accuracy isn't worth it at this point.

If the spread's the best predictor, then I think trying to model the spread could yield some knowledge that leads toward better predictions. It's a simple matter of replacing the final score margin with the spread as the output I'm trying to predict. I'll be following up on this soon.

The bias towards the home team in all of the methods needs to be taken down a notch. In the case of linear regression, 3.2-3.6 points were being added in favor of the home team automatically (via bias term), which is up to a point more than what should be added on average. Rather than using the bias linear regression comes back with, I could use the actual average result for that year, so in 2006, when the average result was 0.84 points in favor of the home team, an extra 2 points wouldn't be added. The bias could also be adjusted for the home field climate type, eliminating the need for the binary variables.

Most importantly, better inputs are needed. All of the computing power in the world isn't going to help otherwise. From the box scores, I can also include kickoff returns and third down conversion rates. I'm also going to follow up on that soon. Other than that, I think something like Football Outsiders' DVOA statistics are necessary. The DVOA stats take the context of every play into account, filter out random noise, are adjusted for opponent quality, and break down into very specific situations and for specific players and groups of players. DVOA for the pass defense actually measures quality of the pass defense, unlike the YPG stats.

As long as this article is, I glossed over a good deal in terms of specific results. To put it shortly, we could do better. In a way, I spent several months discovering what I pretty much knew already: YPG stats aren't very valuable, warm-weather teams have trouble in cold weather. But it was nevertheless interesting to quantify things like home-field advantage (more on this coming). The research is going to need time to evolve, and my resources are limited. Can't be afraid to fail.

2 comments:

Brian Burke said...

Derek,

Admittedly I know very little about neural networks or support vector machines. And by 'very little,' I mean nothing.

But I do know that one of the requirements for valid linear regression is a normally distributed continous dependent variable.

Real game point spreads are neither continuous nor normal. I can easily accept the continuous shortcoming, that's pretty common.

But the distribution of point spreads is very far from normal. There are lots more 3, 7, 10, or 13 point spreads than 4, 5, 8 point spreads because of the way football is scored.

Derek said...

Very good point about the normal distribution. I might have posted this in a later entry, but in terms of the standard deviation, the spreads actually follow 68-95-99.7 rule. Of course, this does not mean that it's normally distributed. Linear regression works well on many problems because many problems' true distributions are similar to (though not actually) normal. Though with about 95 actual outcomes/discrete classes (many of which with few examples), I just thought it was easier to treat it like a continuous variable.

The point of using NNs and SVMs was to introduce some non-linearity to the approximated function. You could certainly just use non-linear variables in a linear or logistic regression instead, which you've tried.