Monday, October 29, 2007

The Key to Good Predictions (though not yet a solution)

(Credit goes to Aaron Hermann of for bringing the central fact of the article to my attention.)

As previously discussed here, averages aren't very good for predictions because players and teams perform above and below averages. That's how you get the averages. "We" here at the blog use a system that classifies games as home wins or losses based on the stats from the game, which is 85% accurate, a large number of the misclassification being due to recovered fumbles. But it uses almost 15 individual stats to get this accurate. Is there one or two stats that we could try predicting and get close to the same accuracy?

Yes. As it turns out (unsurprisingly), the key statistic is yards per pass. Including sacks and sack yards lost, the team with the higher pass efficiency wins ~70% of games. Here's the year-by-year breakdown:

Year% of games home team has higher pass eff.Correlation of (Home yds/pass - Away yds/pass) with (Home pts - Away pts)% of games won by team with higher pass eff.

So not only does the more pass efficient team win, but the correlation shows that the stronger the advantage is, the higher the margin of victory will be. The home team is more efficient passing only 53-54% of the time, showing that playing at home gives a slight edge in actual performance, but it's a smaller edge than what the actual results might indicate (~58% of games are won by the home team).

Using unadjusted stats translated into a percentage over the league average (VOLA), I tried using a logistic regression model based on home and away passing efficiency, sack rates, and interception rates on offense and defense to try and predict (Home yards per pass - Away yards per pass). Training on 1996-2005 and testing on 2006, the model was only 58% accurate in picking the team with the higher pass efficiency, while picking only 46% of home teams to be more pass efficient. On average, the absolute error was a whopping 2.11 yards. Given that each team runs 30 pass plays, we're off the mark by 63 total pass yards/game.

Predictions are only as good as your inputs. You can throw all the complicated algorithms in the world at a problem, but if the data's no good, your results won't be any good. So this is where I'm going to focus most of my creative efforts, and I hope to have a different prediction model for the 2008 season. This might require a complete redo of the data I work with, but like the Yankees and the Dolphins, if you're in "we're one piece away" mode for too long, you need to rebuild.

No comments: