Recently I read this blog post which looks at applying a hierarchical Bayesian model to understand the uncertainty of the attacking and defensive strengths of different teams in the Six Nations. It is a replication of another blog post which applies the model to data from the English Premier League. The models used by both blogs are based on a paper outlining two such models and applying them to Serie A, the premier division of Italian football.
The league portion of the 15/16 Guinness Pro 12 is now finished. I thought it might be fun to apply the same technique to this data, especially since there are playoffs at the end of the competition consisting of two semi-finals and a final which could be interesting to simulate. Unfortunately I didn’t get this post up before the semi-finals last weekend, but better late than never! Unlike the blog posts above which use python I have done everything in R using JAGS to perform Markov Chain Monte Carlo.
Getting the Data
I expected a good bit of wrangling to get the data I needed but it turned out to be pretty straight forward. On the Pro 12 website I went to the Fixtures and Results section. All the data I needed was there but I haven’t scraped much data from the web in my time. I took a punt and googled something like R html table and found the readHTMLTable function which is a part of the XML package. So I pointed this at the Pro 12 URL in my browser and hoped for the best!
It seemed to return a list containing a NULL object and two dataframes. The dataframes were both quite similar but I just picked one and then went about tidying it up. It’s all shown below.
After this we have the following data set consisting of the home and away team names and the score.
Next I associated a team id with each team and calculate the total points they scored and conceded in the season.
Now the results table and the teams table look like this
A few spot checks of the pts.for and pts.against columns against the final league table on the Pro12 site confirms all is well. Also the total of each should be equal.
A scatter plot gives a simple overview of each teams attacking and defensive strength.
The goal of the remaining analysis is to use the individual game data to understand the uncertainty of these attributes.
Modelling
The model is well described in the links above but I will describe it here as well for completeness. The idea is that each team \(i\) has an underlying attack and defense parameter - \(att_i\) and \(def_i\).
It is assumed that the number of points a team will score in a game is a random variable following a Poisson distribution and its rate parameter is derived from the attacking and defensive strengths of both teams. Let \(p^h_{ij}\) be the number of points scored by the home team in a match between teams \(i\) and \(j\) and let \(p^a_{ij}\) be the number of points scored by the away team. Then
Here \(intercept\) accounts for the number of points you would expect teams to score in a game on average, while \(home\_adv\) accounts for the extra points you might expect the home team to score. Note that the intercept term isn’t included in the paper looking at Serie A.
So when teams \(i\) and \(j\) play, the model calculates the log of the scoring rate for each team by taking the attacking strength of the team and adding the defensive strength of the opposing team. If the defensive strength is negative then it is more likely the attacking team will score less points. Note \(\theta^h_{ij}\) and \(\theta^a_{ij}\) will always be positive since the inverse of \(log\) is \(exp\) which only returns positive numbers.
So as to make the model identifiable the attack and defense parameters are constrained so that the sum of all the attack parameters is 0 and the sum of all the defense parameters is 0. This makes sense to me since we are only interested in the differences between parameters and we could add/subtract an arbitrary positive number \(x\) to all the \(att\)/\(def\) parameters and obtain the same likelihood. So it makes sense to make things easier for the model and use a constraint i.e.
\[\sum_i att_i = 0\]
\[\sum_i def_i = 0\]
Finally prior distributions are specified for each of the parameters. \(att\) and \(def\) have flat prior distributions with hyper parameters \(\tau_{att}\) and \(\tau_{def}\). These have Gamma prior distributions since they should always be positive. A flat prior is one that is generally uninformative, though all priors hold some sort of opinion.
Note that the second parameter of the normal distribution is the precision rather than the standard deviation.
Coding the model
Here’s what the model looks like in the .bug file.
model{# Hyperprior for tau_attack and tau_defensetau_attack~dgamma(0.1,0.1)tau_defense~dgamma(0.1,0.1)# define the priors of the attack and defense priors. N is the number of teams.for(iin1:N){att[i]~dnorm(0,tau_attack)def[i]~dnorm(0,tau_defense)}# Sum to zero constraint. These are the variables that will be used to calculate # the scoring ratefor(iin1:N){att_centered[i]<-att[i]-mean(att)def_centered[i]<-def[i]-mean(def)}# Now for each game specify the scoring rate for the home and away team.# G is the number of games. home_team and away_team give the team id of# the home and away teams for a given game.for(iin1:G){log(theta_home[i])<-intercept+home_adv+att_centered[home_team[i]]+def_centered[away_team[i]]log(theta_away[i])<-intercept+att_centered[away_team[i]]+def_centered[home_team[i]]# This links the observed scores to the theta parametersobs_points_home[i]~dpois(theta_home[i])obs_points_away[i]~dpois(theta_away[i])}# priors for home advantage and interceptintercept~dnorm(0,0.0001)home_adv~dnorm(0,0.0001)}
The following runs the model
I originally ran 20,000 iterations with no thinning but found that the auto correlation in intercept and home_adv was too high. So I ended up running 150,000 with thinning at 10 and this seemed to do the trick. I ran three chains so that it was possible to create Gelman and Rubin’s shrink factor plot. Here’s a full summary of the first chain:
Not very informative! We’ll graph most of this later on to make it clearer.
Diagnostics
When learning about all of this I was most interested in figuring out whether the final model was sound. Doing Bayesian Data Analysis by John Kruschke provides code to create diagnostics by combining different plots from the Coda package. The things we should be most concerned about are:
A representative distribution
Accurate estimate of parameters
Producing the sample efficiently
Here is the diagnostic plot for the intercept parameter:
The top right ‘Param. Value’ plot shows, by inspection, that the three chains have all formed a stationary time series that overlap. This is an intuitive indication that they have converged, although it doesn’t guarantee it since all the chains could be in completely the wrong place. However this is unlikely and it is looking good that the samples are representative.
The auto correlation plot shows auto correlation at different lags for each chain. There is very little and this indicates that the Gibbs sampler has done a good job of exploring the possible parameters. This is another good indication of the samples being representative. Since there is no auto correlation the effective sample size is very close to the actual sample size. It is worth noting that there was a lot of auto correlation for the intercept when I ran the same model with 20,000 iterations and no thinning. In order to remove the autocorrelation thinning was added and the sampling became less efficient.
The shrink factor plot on the bottom left is probably the most involved so I will keep it brief. It compares the within chain and the between chain variance. See ?gelman.diag in R for more details! Apparently a good heuristic is that there is trouble if the value is greater than 1.1.
The final plot in the lower right hand corner shows density plots of each chain. They overlap which again indicates that the chains converged. The Monte Carlo standard error (MCSE) is the standard error of the mean of the parameter but with the effective sample size taken into account. We can see that the uncertainty for the intercept parameter is small and we can be happy that we understand the accuracy of the estimate.
Here is the home_adv diagnostic plot:
Again everything looks in order and we can see that we expect the mean value of home_adv to be around 0.268 with 0 well outside the HDI. Since we have
Home advantage multiplies the teams scoring rate by a factor of \(e^{0.268} \simeq 1.307347\). So we expect that the home team will score roughly 31% more points on average then they would if the match was being played in the other teams ground. Exchanging home advantage can lead to a very different game!
The other 26 diagnostic plots look good, so I won’t include them here.
Comparing Teams
A nice plot that is generated in both the blog posts and article mentioned at the beginning is one of attacking parameters vs defensive parameters for each team. It required a good bit of wrangling to get a data.table in the correct shape.
After all of that dt.param.team looks like this (pity about the wrapping)
We can create the plot now!
A notable observation is that this plot looks very similar to the plot of points for vs points against that we created at the beginning.
This is probably no surprise but what have we gained? Well a 95% HDI for both the attacking and defensive parameter of each team is also included. So now, in addition to the expected means, we also understand the uncertainty of the underlying attacking and defensive strengths of the teams.
Comparing the two plots is also valuable from a validation point of view - if they looked different it might indicate something had gone wrong!
Simulation
With an understanding of uncertainty we can perform simulations and figure out who is likely to win the Pro 12 Grand Final. Of course the only information the model has are the results in the league preceding the semi-finals and final. There are other factors that might be included in a more complex model e.g. injuries to key players.
When using the outputs of MCMC to simulate it is important to understand that each of the parameter values in each sample of a chain are only valid when considered together. For instance here are two different samples from the first chain overlayed on the plot above - one as red dots and the other as green.
Obviously it is not always clear which dots belongs to each team, but the point is to observe how the shape of the overall landscape can change. The parameters from each sample are bound together, so when we sample from the distributions we must respect this. Never mix values from different samples of the chain.
My code for simulating the playoffs is below. A couple of things to note are:
if there was a draw I just let the teams play again since the model has no concept of extra time etc.
The Grand Final is at a neutral venue which the model also doesn’t understand. So I played the game twice with each team playing at home once. I then took the aggregate score. I thought about setting the home_adv parameter to zero but I’d worry this would upset the balance of the model.
The OnePlayoff function plays one iteration of the semi-finals and the finals. Its top.four argument specifies the team.id of the teams that finished in the top four and in what order. Note that the sample s is set at the start of this function and passed to the SimOneGame function when required.
Now to run some simulations based on the top four of the league:
Leinster Rugby
Connacht Rugby
Glasgow Warriors
Ulster Rugby
Note that the competition rules say that the top 2 get home advantage. This means that the semi-finals are
Leinster Rugby vs Ulster Rugby (i.e. 1 vs 4)
Connacht Rugby vs Glasgow Warriors (i.e. 2 vs 3)
So given just the results from the league and the final configuration of the top four the model says there is a 55% chance that Leinster will win the Pro 12. We talked about home advantage being very important above, and given that the top two teams in the league get home semi-finals it might be interesting to see how things would change if the top four configuration were different. For instance what if Connacht and Glasgow swapped positions i.e.
Leinster Rugby
Glasgow Warriors
Connacht Rugby
Ulster Rugby
Then we get the following
Things change quite dramatically! Home advantage is crucial.
The final
In the intro I mentioned that I didn’t get this post online before the semi-finals. One benefit of this is that we know that Leinster beat Ulster and Connacht beat Glasgow. The SimOnePlayoff function above returns the path of the final result. So for example 10-1-1 indicates team.id 10 won the first semi-final, team.id 1 won the second, and then team.id 1 ultimately won the Grand Final against team.id 10. Leinster have team.id 10 and Connacht have team.id 1, so let’s look at the distribution of the paths.
We can see that a Leinster Connacht Grand Final was always the most likely outcome accounting for around 50% of the finals in the 5,000 simulations. Of these finals, Leinster won roughly 4 of every 5.
Personally, I think the final will be much closer than this analysis suggests. It is important to note that the model doesn’t have any concept of time. Results at the start of the season are given the same importance as results at the end. Therefore the parameters being estimated are an average from across the season. Given the volatile nature of sport (injuries etc.) this is not the most prudent approach when understanding the uncertainty in match results. However it is a fun exercise none the less!
This site uses cookies for things like Google Analytics.
More information is in the Privacy Policy which you can view now
by clicking the cookie or later using the link at the bottom of the page.