To calculate the probability of football results, I have so far used the Poisson distribution for the number of home and away goals in a match. Depending on the difference in Elo points between two teams, the average number of goals scored and conceded were taken as parameters for two independent Poisson variables. Check this article for details.
However, the average values that I approximated through a curve were taken once and did not change since and I took present data to calculate past rankings. Moreover, using only the Poisson distribution does not reflect too well the nature of a game. Especially the likelyhood of a draw was never more than 27% and that underestimated the real percentages that can go up to 30%, 31%. I have been thinking for months how I could improve this without making the model too complicated.
Now I am coming up with an improved method that I am really happy with. It uses the strength of clubelo which is its comparatively large database. Two major changes are implemented. The Poisson variables are now adaptive and the set of past results will be used to predict results.
The core of the rating and prediction system is still Elo. For every difference in Elo points, there are two parameters: Average home goals and average away goals. Initially, I will use a distribution where the average number of away goals is 1.6 (changed on 14/10/2013, before it was 2.0) divided by the average number of home goals. The second constraint is that the result prediction from the combined Poisson distributions has to be equal to the prediction from the Elo system. This is done for every percentile and serves as a starting point.
The outcome of every match will influence the Poisson parameters in a way that the new parameters for average goals consist of 99.9% of the old parameter and 0.1% of the new result. This way, the parameters will smoothly approach their real values and will also change over time. Below you can see the current values:
As we have seen, the Poisson distribution is not sufficient to predict football results accurately enough. It seems that clubs do settle unnaturally often for a draw. It is very hard to find a simple predictive model for this behaviour. I decided that the way forward is not to try to simulate what happens but to see what happened in the past. For each percentile there are hundreds, sometimes thousands of matches in the database. I assume that calculating the distribution of these results is the best way of predicting what will happen in the future. It is done in the following way:
We start with an empty 2-dimensional result table for each percentile. When a result occurs, every other value in the table is multiplied by 0.999 and then 0.001 is added in the cell that corresponds to that result. If there are many games for a percentile, the sum of this distribution will approach one and recent games are weighted more importantly compared to old games. For every percentile, there is a difference between the sum of result occurences and 1 - sometimes more and sometimes less. This rest will be filled up with the predictions from the Poisson distributions. This should minimise statistical noise if there are not enough games. As the nature of 2-leg-games is different and not comparable to league of group matches, the results history method will not be applied to second leg matches.