Where do shots come from?
Introduction
“Football is a simple game but playing simple football is the hardest thing there is” is a famous quote by Johan Cruyff. Modern day football produces massive amounts of data, more than any human can comprehend. The challenges that teams face now is how to make sense of that data in a way that would present them a competitive edge. People say the devil is in the detail and it is more often than not where the fine margins make the difference between winning and losing. But before going to the fine detail one needs to know where to look. Football is a game with endless possibilities and one can very easily fall into the trap of analysing endlessly without reaching any concrete outcome (or at least not to the detail that is demanded). In such environments, tools which narrow down the focus of analysis by initially focusing on the most probable answer of a certain question, may prove to have merit. Tools which trace the relationship between events may provide stories in a time-efficient manner and in turn the time saved can be invested in analysing the finer details of a given sequence, something that could make the difference.
Association analysis will be used to determine the most probable events, that lead up to a shot, and their location. This can provide a quick reference point, based on historical data, for the areas which have the highest threat of leading to a shot. This can be used by teams to get a quick reference of their performance and concentrate on improving other areas, or gradually introducing new ways in order to keep being unpredictable. It can also prove to be beneficial if such analysis is present for opponents. Knowing where the threat is most likely to come from can allow for preparation on how to counteract the build-up to that area more effectively, how to control and restraint, how to set the defensive shape in order to make these areas hard to access.
Association analysis
Association analysis is a method used to mine patterns among high volume of events. It groups all the events which happened together and then analyses which occur together more often than not, compared to the total amount of event sequences.
Methodology
The dataset that will be used is the WSL 2020/2021 season data provided by StatsBombs free data repository and Chelsea WFC has been picked as a subject of the analysis. All games that Chelsea WFC have played have been filtered in order to produce a dataset of events which occurred only during their games. Several assumptions have been made to aid the analysis, namely:
1. The pitch has been split into 30 zones
Analysing the coordinates of each event independently would not provide any insights due to the huge variety of combinations for X and Y. Hence, each event has been labelled with the zone in which it occurred, which would be determined by the X and Y values of the event. Figure 1 outlines the zones in which the pitch was split. The zones split the pitch into 6 laterally in order to break down the opposition half into 3rds and into 5 longitudinally in order to include the half spaces which have become popular in modern football.

After all of the above has been done, the dataset has been iterated through in order to combine each shot with its preceding events into a single itemset. The itemset would contain a combination of strings which would describe the data in the following format:
{Event Type} — {Event Zone}
An example itemset of a shot would look like the following:
(Pass-Zone22) -> (Pass-Zone23) -> (Pass-Zone30) -> (Ball Recovery-Zone28) -> (Shot-Zone28)
2. Carry and Ball Receipt events have been filtered out from the dataset
Both have been deemed to not add extra value over using just the Pass event in the context of the analysis. Every Pass has a Ball Receipt linked to it which is very rarely positioned in a different zone to the next event, making it redundant. And each Ball Receipt is followed by a Carry which represents the players first touches of the ball if he remains in control, which also in itself does not change the position of the next event by a considerable amount.
Before filter: Pass -> Ball receipt -> Carry -> Pass -> Ball receipt -> Carry -> Shot
After filter: Pass -> Pass -> Shot
3. Number of events to be considered before a shot event for Chelsea — 10
This number has been established on a trial and error basis. Trying with less number of events serves the concern of not catching too little of the teams sequence and on the contrary using too many can result in including events which are distant from each other with respect to time. It is important to note that the events being considered before the shot are not only for Chelsea. Instead all 10 events irrespective of the team have been included and then the ones which occurred for Chelsea have been selected. This further protects against the scenario where the opposition has been in possession for a considerable amount of time before a shot and the events selected are not latter part of the event sequence is not actually linked to the shot.
4. Only shots which meet a certain xG threshold would be taken into account
StatsBomb have an xG metric which indicates how likely it is for the shot to result in a goal. That metric will be used to filter only the shots which can be considered a threat on goal. Looking at the numbers, 25% of the goals that Chelsea have scored have an xG of less than 0.1 which is a considerable amount to be left out. Table 1 represents the percentage of goals scored within an xG range. Based on that a decision has been made to use an xG > 0.03 as a threshold for shots to be considered

Once all of the item sets are generated all transactions are one-hot encoded and get prepared to be analysed using the Apriori algorithm. In the context of the analysis the results will be interpreted based on the following criteria:
1. Associations with a high support will be highlighted as it will serve as an indication that certain event-zone combinations are frequently present together.
2. Associations with high confidence will be considered given that a certain event-zone is not present a lot of times. However when it is, it leads to a certain event-zone most of the time.
Results
Before the Apriori is performed the data has been analysed to check if it makes sense. Once the item sets are generated, they are further described in Figure 2 which outlines the most common event-zone in each position before a shot was taken.

Figure 2 is showing that the maximum amount of events present in an itemset is 8. However, as we can see the event chain has included 8 events only on 2 occasions (see [0]) and 7 events on 24 occasions (see [1]) which would already give an indication that long build ups before a shot are not a regular occurrence. On the other side, it can be seen that the most frequent outcome from a sequence is a Shot in Zone 28 which is to be expected given that it is the most dangerous area of the pitch. From Figure 4 we can already start getting an idea of what the preferred way for Chelsea to get a shot is, with the most frequent event being a ball recovery in zone 28 and a pass from zone 26.


Figure 5 serves as a confirmation of what was observed earlier. Out of all shots, Pass-Zone26 -> Shot-Zone28 have been featured together 14.5% of the time whilst Pass-Zone26 alone has been featured 17.4%, which is a considerable amount, given the variability of event and zone combinations. The relationship is further confirmed by the lift and conviction values. The same conclusion can be observed for Ball Recovery-Zone28 -> Shot-Zone28 and Pass-Zone27 -> Shot-Zone28 both of those relationships have been featured more than 10% of the total shots (12.3% and 11.7% respectively) and have a high confidence which consequently leads to higher lift and conviction values. Further down it can be seen that the relationship Pass-Zone16 -> Shot-Zone28 also has favourable values, however it has been present only 7.9% of the time and with a confidence of less than 80% it can be overlooked. A potential chain of events is formed with the relationship between Pass-Zone21,Pass-Zone26 -> Shot-Zone28, Pass-Zone27,Pass-Zone23 -> Shot-Zone28 and Pass-Zone27,Pass-Zone22 -> Shot-Zone28. All of those relationships have low support, however when they are present together they almost always lead to a shot in zone 28 as indicated by the high confidence.
What is interesting to see is that passes from zone 22, zone 23 and zone 24 have a lift and conviction which is lower than 1. That indicates that the ball being in these zones does not necessarily lead to a shot from zone 28. Given the high amount of shots taken from zone 28 with respect to other zones it can be concluded that if a shot is to be taken from zone 28, these zones would be more likely not present in the sequence.


Conclusion
In short, Chelsea like to attack from the left-hand side. Most probably from a cross, and more often than not a shot is being recorded from a ball recovery in front of goal after a pass. However, the details of the cause can be further looked into to be explored in greater detail. The analysis has fulfilled its purpose which is to draw the attention to the most probable threat of a shot on goal. This initial estimation can be used in a multitude of ways. It can be used by Chelsea to focus their attention on why the left-hand side works better than the right-hand side. It can be further broken down into passing patterns, player traits, player relationships in order to estimate why one thing has more success than another. It will fulfil a greater purpose for an opposition team which would want to know what is the most probable threat on goal. If presented with this initial estimation, the opposition team can then look into the build-up and control and restraint the ball as much as possible to not allow many chances in those areas. It can focus its attention on what do Chelsea players do in those areas to create changes much sooner than with typical video analysis and would spend the gained time to analyse player behaviour more granularly.
The above can serve as a use case of how mining patterns can serve as a tool to turn the attention to the most probable answer of a given question. It does not replace other analytical methods but rather complements them. With the use of such tools one can narrow down the focus in a much more time-efficient manner by initially disregarding lower probabilities and can subsequently spend the time saved to break down player behaviour, patterns or anything of interest in finer detail. That is something which could make the difference.
Comments
Post a Comment