Soccermetrics: An Opening Salvo

{This is the first in what I think will be occasional discussions of trying to analyze soccer. There’s a lot here, and a lot churning in my head about it, but this piece tries to lay out some of the initial context, drawing on my own somewhat deep engagement with the early decades of sabermetrics. Like a lot of what I write, it feels too long for your usual internet content. Sorry about that.}

I’ve been here before.

I have fond memories of the old Baseball Abstracts, almost amateurishly published in softcover by Bill James in the 1980s. I remember laughing out loud at them: James never got enough credit for his humor. I remember the sense of helplessness as he showed that his beloved Frank White was indeed a better second baseman than my beloved Willie Randolph.

Most of all, I remember the sense of excitement that accompanied the growing understanding that new knowledge was being produced, that something was changing. Seen globally, trying to make sense of baseball is a small thing, but that constrained universe was undergoing a sea change.

Baseball, though, was lucky. It was lucky in a lot of ways, but a few stand out

First, the game has always been a long succession of one-on-one confrontations. Yes, those batter versus pitcher moments only exist in the context of inning and score and runners on base and (perhaps most importantly for later analytic conundrums) defensive positioning. But the batter – pitcher relationship remains the window through which the game was most easily seen.

Second, baseball generated legitimate sample sizes without even trying: hundreds of at-bats, thousands of pitches; things that helped, just by their nature, to overwhelm popular misunderstandings of and downright resistance to things like standard deviation. It remains nearly impossible for a bad hitter to amass 200 hits in a season or a great hitter (when healthy) not to manage 100 over the course of 140 games. As importantly, 150+ games gives enough time for good teams to show they were good and bad teams to show they were bad. Teams over- and under-perform, of course, but it is very hard to be a bad team and finish with a .600 winning percentage. (For some related info, this recent piece at fivethirtyeight talks about how remarkable it is for teams this year to be likely to outperform their predictions by about .050 in winning percentage.

Third, baseball, from the very beginning, counted a great many things and a great many of those were the right things. Yes, batting average turned out, after eighty years of worship, to be a false god. But at bats still mattered; hits still mattered; and there were relationships between the things that counted and the thing that kept score (runs).

All of that built a very solid platform. And once the box was open, we rushed in to grab whatever we could. At first, of course, much of it was the wild flailing of the newly converted. I know, let’s ADD slugging percentage to on base percentage. No, wait, lets MULTIPLY them. Wait, what if we subtracted home runs from stolen bases and converted that into … nah, let’s just go back to on base percentage.

Over time, of course, real math won out and we all scurried to our spreadsheets only to find that actual mathematicians had taken over. Phrases like “normalized for park effects” became popular, as did the never-ending debate about replacement value.

The end result is that we *understand* more about baseball than we ever have before. And yet, note, the game has lost none of its appeal, none of its ability to surprise (2015 Kansas City Royals and Houston Astros, I’m looking at you). Our understanding exists in its best form in hindsight: we can tell you what contributed to the performance of a player or team, but the crystal ball is only slightly less cloudy than it was before when looking ahead.

So, now, to soccer.

There is a thirst to bring analytics to the game, and every other blog post on it talks about “the Moneyball of football.” But it’s a very different world. Revisiting the three points above:

First, outside of penalty kicks, which are skewed heavily towards a single outcome, soccer has almost no isolated moments of player versus player. Everything happens in flux: even the newly minted “take-on” exists only within the context of a team movement, where the position of the rest of the players, more often than not, is a contributing factor to the unfolding play (whether from a breakaway following action at the other end of the field or from a team intentionally trying to isolate a player on the wing). There are arguments against this: set pieces, some individual dribble attempts, things like that. But nothing that exists predictably and regularly, minute after minute, game after game.

Second, the most important thing in soccer occurs incredibly rarely. A single game may contain hundreds of touches of the ball, but only a single goal. Getting the ball into the back of the net seems to have some relationship to those hundreds of touches, but it’s not very clear what that might be, exactly. This is the core of the blurring of the game that occurs when you put on analytical goggles.

Third, and closely Relatedwe sort of intuitively know that counting goals is insufficient. So we’ve rushed into a mode of counting anything we can think of, and most of it is utter rubbish. What are you likely to see  as match statistics? Distance run. Possession, but without context (and with match announcers making hay out of a 53 to 47% edge, which is most likely pure statistical noise). Shots on target (don’t get me started: the wickedly knuckling ball that has the GK beat, but slides eighteen inches wide is “off target,” while the weak header that loops comfortably into his arms is “on target”). We’re counting just to count, and while in baseball there were enough options to make it all worthwhile, here … well … no wonder anoraks get a bad name.

It’s an exciting time: the attention being paid to the game (and, of course, the amount of money at stake) is sure to produce similar advances in our understanding of it to that of baseball between now and three to four decades ago. But we’re a long, long way off right now. And I would claim that the structure of soccer: the complex interactions, the importance of the elusive concept of space, the wide variety of tactical approaches to the game–all conspire to make the question of analytics stunningly complicated.

To heck with complicated: it’s maddening: possession is meaningless when it takes fewer than five seconds for the other side to score; most games are ninety minutes of which the ball is in play fewer than sixty-five, of which the ball is in meaningful play fewer than twenty.

The problem is that most of the information we have might be illuminating, but it falls far short of helping us figure out what players are contributing to a team winning. And without that, it’s all conjecture and visual impression and subtle prejudices shining through.

There is statistical analysis that does a fantastic job of confirming the outliers. These pieces on Lionel Messi, for example, are stunning (especially the first one, go read it now if you haven’t, seriously). But if you want to figure out if Juan Mata is more or less valuable than David Silva to their clubs, it’s awfully difficult. And if you want to compare César Azpilicueta to Romelu Lukaku, you really have no way of even starting the conversation, not if you want it grounded in more than hazy opinions that get repeated often enough to be held as fact.

One more vital difference from baseball: the study of baseball was built on freely available information sources. Soccermetrics has grown at a time where information has already been commodified, and it may very well be that Opta has a huge wealth of data that would support an explosion in the analytics of the sport, but are only opening up their trove for the deep wallets of top-tier teams.

I suspect not, however. I suspect that instead, the game remains resistant to being easily solved, and requires a different set of approaches, something that factors in multiple chains of events with multiple points of inflection. This will, I think, contribute to it remaining elusive: simple ratios and simple counts are unlikely to get us there, which makes the insight into the data more difficult.

Most importantly, it also makes the dissemination and explanation of the data more challenging, as there are few in the soccerverse willing to “just accept” some new version of WinShares without peering under the hood.

Onwards into the fog!

This entry was posted in Football/Soccer and tagged , , , , , , , , . Bookmark the permalink.

3 Responses to Soccermetrics: An Opening Salvo

  1. Chris Gluck says:

    Steve Fenn put me on to this article… for what it is worth you may enjoy some of these articles as you begin to dig in… http://possessionwithpurpose.com/

  2. Thanks, Chris! I struggle with filtering the noise in possession statistics, trying to figure out how much of it is the “current” style of play and how much of it is *truly* purposeful. I think the distinction of attacking third is quite useful, but ultimately I don’t know how to distinguish possession-with-purpose from possession-because-we’re-aping-Barcelona … reading deeper on your site now, thanks again!

  3. Pingback: Soccermetrics: Let the Data … Squeak? | Us3. Online.

Leave a Reply