Monday, September 3, 2007

Theory on Best Possible Accuracy in Prediction of Games

Previously, I've spoken about how predictions based on mean performance will always be flawed because teams almost always perform at a level above or below the mean. The problem is that without advanced statistical number-crunching, the mean is our best guess at what the performance in the next game will be. I've come up with an idea on how one might go about estimating a probability distribution for a range of possible performance levels, but I think I need to look into better methods of estimating such a distribution before going ahead with such an experiment. In the meantime, I was curious about the following: If our estimates were so good that we had 100% certainty that team X in game g would have stat line Sx,g, then how good would our game predictions be? In other words, if I told you the home team would average 4.3 yards a carry and 7.6 yards per pass, etc., along with the away team stats, how accurately could you guess the winner of that game?


This should be simple enough to figure out. We simply take the stats from box scores (rushing average, passing average, 3rd down efficiency, everything I use in the predictive models) and plug them into a regression model for either final score margin (linear regression) or classification as home team win/loss (logistic regression). Keep in mind, I'm avoiding total rush and pass yards because I don't think the application would be as useful. There'd be a wider margin of error trying to estimate a probability function for the exact number of rushing/passing yards gained/allowed in a game. It depends on how far a team gets ahead and other clock-related issues. Using data from the 1996-2006 seasons, I tested on every year after 1996 separately, training on all the years before each year.

The average win-loss classification accuracy is 84.351% for linear regression and 83.722% for logistic regression. That is a very big jump from what you get using means (~60-69% range). The average mean absolute error on predicting scoring margins was surprisingly high at 6.1798 points, but the estimated margins have a 0.84564 correlation with the actual margins. So even if you knew the exact stat lines, you'd still be off by a touchdown on average! With this model, the problem of classifying too many games as home team wins almost disappears. Only 59.137% (linear)/59.623% (logistic) of games are predicted to be won by the home team, as opposed to the actual 58.84%.

As has been documented several times, 2006 was a fluky year. An abnormally low percentage of home teams won (55.56%), largely because of how interconference games turned out. While all other models suffered in accuracy and classified far too many games as home team wins, both the regression models here classified 55.469% of games as won by the home team. Almost an exact match! 2006 was still yielded the lowest accuracy (82.031% lin/82.813% log), but the dropoff was still low. So maybe 2006 wasn't quite as fluky as I thought, and our methods of evaluating teams, specifically in inter-conference settings needs to change.

Of course, this is all based on 20/20 hindsight, but it gives a good idea of what could be achieved with prediction models. 82-85% accuracy is achievable, if you knew the stat lines beforehand, but any estimates of single-game performances will come with a good deal of error. The mean is a simple and efficient estimate, but big advances should be able to be made with more sophisticated performance projections. Brian Burke's estimate of 76% accuracy seems reasonable, but I wonder if the ceiling can't be pushed a little higher to ~80%.

No comments: