analytic.football - New Model, New Explainer

Before we get started, you can view the model at analytic.football.

Classically, as soon as I had finally published the previous explainer, I went down a rabbit hole of improving the model, rendering it almost immediately out of date. In the last couple of months the model has gone through some pretty hefty upgrades, enough to warrant a new blog.

The basic idea is unchanged:

  1. use team ratings to predict a match

  2. compare the prediction to what actually happened

  3. update the ratings based on the error

Likewise, the inputs are the same, still non-penalty xG, penalties, goals and red cards. And the same adjustments to a match occur, downweighting penalties and applying a correction for red card effects.

What’s changed is a lot of the internal plumbing, the previous version had quite a few “good enough for now” type hacks, and this aims to tidy everything up with a bunch more statistical rigour. So let’s go through the changes.

Welcome League Environment Variables

In the previous version of the model, the team ratings are absolute, and the league-wide scoring rate, penalty rate and such are inferred from the team ratings themselves (i.e. they just fall out of the average). It means that the tuned speeds are being asked to do two things at once:

  • Represent how good this team is compared to the league average

  • Represent the league-wide scoring environment

In v2, league-wide stuff is tracked explicitly as global baselines that evolve over time:

  • Global non-penalty xG rate (“how much non-penalty xG does a typical match have right now?”)

  • Global penalty xG rate (“how many penalties are happening?”)

  • Global finishing rate (“is the league converting chances above/below normal?”)

Team ratings become more purely relative again (think “multipliers around 1”), while the globals carry the absolute scale. This cleanly enables the parameters to do their job of ratings teams relative to each other, without the need for workarounds such as the constant.

Goodbye Constant

The previous version of the model had a small constant term in the ratings. This was necessary because without it the “penalty proportion” was handling both the signal about future performance it contains, and contributing to the league-wide total (i.e. down-weighting penalties would down-weight the league total goals predictions).

But it was always a bit of a hack. Since team ratings are now relative, and the league wide effect is handled explicitly, there is no need for the constant any more, the “penalty proportion” is only being asked to handle the question how good do we think this team is, relative to other teams?

Two Signals Are Better Than One

Ok so we’ve cleaned up some of the nasties from v1, but now we’re into the meat of the updates. The chance creation module has changed throughout. In the previous version of the model, the short-term rating would update by:

  • Predicts using the short-rating

  • Take the error from that prediction

  • Blend with previous prediction errors (in defence)

  • Update short with blended signal

  • Update long from short rating

The basic idea of blending the previous errors is to provide some smoothing to the rating input, such that it reduces the model’s need to bounce around to match-to-match performance fluctuations in order to move quick enough when teams underlying level does truly shift.

However, once again, this is a bit of a hack. The major issue is that previous match errors themselves don’t provide an accurate signal of the level of the team, since they were errors relative to the rating at the time. This mechanism meant that the model had a tendency to overshoot beyond and then rebound to the correct level after a team’s level changes. And indeed, this is perhaps why the model didn’t want to use this mechanism in attack previously.

In version 2, we directly track the recent level of the team. We maintain exponential moving averages for xG for/against, and what we would expect a league average team to generate in that match (i.e. taking into account opposition strength, home advantage & red cards).

In the new version, there is no ad hoc blending of these two signals, but rather they feed into two separate updates:

  • Blended rating makes a prediction

  • Short updates from prediction error (and decays towards long)

  • Long updates from exponential moving average (and decays to league average)

  • Long & short are blended for team rating

Let’s go through some of the changes.

Long Rating Now Directly Tracks Team Level

In both models the long rating is meant to act as a slower moving, more sober view of team quality. In v1, the long rating is essentially a smoothed version of the short-term rating, meaning it could only learn from what the short one had already digested. When a team changes level, first the information has to feed into the short rating, and only then does the long rating start to follow.

The short-term speed is therefore forced to do two jobs at once: it’s trying to be the best “next match” predictor, and it’s acting as the messenger that delivers new information to long-term.

In v2, long-term updates directly toward a rolling, schedule-adjusted view of team performance. The match goes straight into the moving averages (what happened, and what a league-average team would have been expected to do in that same fixture), and long-term can decide how quickly to respond to that signal.

Put another way: v1 was “mix surprise with previous surprises, short reacts to mix, long tracks the history of short”. v2 is “short reacts to surprises, long learns from measured level”. It’s a cleaner separation of jobs, and reduces the number of accidentally enforced delays for information to pass through the model.

Long Ratings Mean Revert

I’ve mentioned in a few places that the long rating decays towards the league average in v2, a new feature from v1. Why? Seems barmy, some football teams have bigger budgets and success begets money begets success. So it seems like I’m deliberately making the model worse, if a team really is elite for years why would I want to pull their rating away from that?

The answer is because long-term should behave like a prior, not a history. Squads churn, players age, managers and owners change, promoted teams get themselves established. The evidence we have from before loses value over time, it goes out of date. This architecture demands that the model keeps getting new evidence to maintain a prior away from the league average.

But you’re right, in a steady state the long-term rating will remain static at a level below the team’s quality. This isn’t particularly a problem since the long rating isn’t meant to be an independent view on the team’s quality, it’s meant to be a rating component. But it does necessitate another architectural change.

Short Ratings Use Blended Prediction

In v1, the short component sat as an independent predictor. It made it’s own xG prediction, got its own error, updated itself while the blended rating incorporated long as a historical ballast underneath.

In v2, the short rating still reacts to match errors, but those come from the model’s final prediction - i.e. using the blended ratings.

Since we know that in a steady state, the long rating is deliberately conservative, and in a steady state will be slightly pulled towards the league average. While this is how a prior should behave, we need a way for the short rating to carry the extra deviation in a steady state, to avoid systematic overprediction of parity.

Let’s imagine a scenario where the short rating is accurately describing a team at 2.0, and the long is sitting steadily at 1.9 because of the inward pull. Assuming a 50/50 blending the model’s rating will sit at 1.95, below the team’s true level.

If we used the “short only” architecture from v1:

  • Prediction from short only: 2.0, outcome: 2.0, short doesn’t move.

  • Long rating is pulled towards 2.0 and towards the league average. Remains at 1.9.

  • The blended rating remains unchanged at 1.95, despite the team performing better than the model’s blended rating.

The model would systematically under-predict good teams and over-predict bad teams in the long term.

Using the blended ratings for predictions:

  • Prediction from blend: 1.95, outcome: 2.0. Residual is positive, so short is pushed up.

  • Long rating is pulled towards 2.0 and towards the league average. Remains at 1.9.

  • The blended rating moves up towards the team’s true level (2.0). Eventually short will settle at the value which leaves the blended rating at 2.0.

Here the short rating is getting positive signal, and will move to compensate for any bias in the long rating. In the steady state the blended rating will reach 2.0, the correct level for the team. It sounds a bit fiddly, but this stops the components fighting each other.

Short Moves Towards Long

This is a bit of a flip from v1, where long was moving towards the short rating. But this is a necessary piece in the model’s pieces having clear jobs.

The short rating is the model’s “react immediately” piece. When a team suddenly improves (new manager, new signing, Daniel Farke’s magical half-time-at-City fairy dust), the match-to-match errors will start shouting “we’re underpredicting them” and move accordingly. That’s exactly what short is for: it jumps first.

But the model also has a slower, more stable way of learning the new level: the moving average / measurement signal. After a few matches, the EMA starts to agree that “yes, this team has genuinely improved, it wasn’t just a flash in the pan”. At that point, the long rating begins to move as well.

If short didn’t decay back toward long, both components keep carrying the same improvement at the same time. Short jumps up early, then long starts rising underneath it, but short never hands any of that weight over — so the blended rating ends up too high. You then get a classic pattern: overshoot, then a correction, then another overshoot the next time the signal moves. Bouncy ratings and worse predictions.

The reversion towards the long is basically the handover mechanism. Short is allowed to react first, but as the measurement signal is absorbed into long-term, short is pulled back unless the match errors keep insisting it needs to stay elevated.

Between Season Uncertainty

While we want to maintain continuous ratings, there is significantly more uncertainty about a team’s level early in the season. Manager changes and squad turnover over the summer mean that we should hold weaker priors about a team going into a new season.

In addition, we know that we have more uncertainty about the three promoted clubs than the 17 that remained in the league. Not only is there league translation uncertainty but also these clubs are likely to have the highest turnover of players as they prepare for like in the PL.

The moving averages provide a neat way to handle this uncertainty. For each team we keep track of a decayed running total of xG, and the schedule difficulty, these generate the moving average ratios that ratings can update towards.

The old model did account for promoted clubs in an ad hoc way - through the initial denominator on the running totals for match prediction errors. However, this was a guesstimated parameter, and we did nothing to handle the other 17.

The new version introduces a “season offset” which weakens the prior for the remaining 17 clubs. This means that it throws away some proportion of the old season data, and the moving average signal (and thus the long rating) will move quicker early season. I tune similar parameters for the 3 promoted clubs, ensuring they have weaker priors than the other 17.

More Sceptical On Finishing

The finishing module has smaller updates, but should in general behave a bit more responsibly. While the basic mechanics are the same:

  • Running totals for xG & goals update

  • Finishing rating moves towards goals/xG ratio

There are two new features that better reflect the scepticism we should have around finishing effects. The first is that instead of updating towards the goals/xG ratio it introduces a ballast, updating towards (goals + c) / (xG + c). This should reduce bounciness as the model will not treat the extreme finishing variance so naively.

In addition, the rating also decays towards 1. The model getting stuck on a team’s finishing level after a hot/cold spell was particularly prominent because of the slow speeds that finishing responds to. With explicit forgetting it means that teams need to keep demonstrating they are good finishers in order for it to continue believing, instead of just waiting for the signal to slowly drop out of the moving average.

Residuals instead of linear errors

In v1, the update signal is basically “actual minus predicted xG”. That’s intuitive, but it has a catch: the same 0.5 xG miss doesn’t mean the same thing in every match.

In v2, the update signal uses a Gamma deviance residual instead. This both reduces the impact of high-volume games, and provides some weight to the ratio of the miss as well as the difference (e.g. 1.0 vs 0.5 is a bigger miss than 2.5 vs 3.0).

In a technical way it is answering the question “how surprising was this outcome given what we expected?”. This makes the update signal more comparable across different match types, allowing better distinction of signal & noise.

Multiplicative Not Additive

This is more of a housekeeping change than a new feature, but it makes a small difference. In v2, everything lives in “multiplier land”, with blending (& updating) done geometrically.

This makes rebasing a lot simpler, where before in v1 raw rating values adjust after each match to align the attack and defence averages.

Now since raw values are dimensionless, and league averages are handled globally, it just uses harmonised ratings (centred on 1). The separation of jobs reduces the small drift seen in v1.

Dixon–Coles Correction

This doesn’t affect the ratings themselves, but the predictions got an upgrade. Previously the model was using independent Poisson from the goals predictions to generate the W/D/L. It’s not a bad approximation, but it systematically underpredicts low-score draws. Dixon–Coles is a standard correction: it tweaks the probabilities of 0–0, 1–0, 0–1, 1–1 to better match real football’s “game state” effects.

Aaaaand breathe.

What’s the upshot? A better model: more stable ratings, fewer cases where teams get stuck, and a consistent improvement in future match predictions (measured on out-of-sample matches using scoreline log-likelihood with a Dixon-Coles adjustment for low-score correction) Better model means a better table sort, and that’s what we all live for.