{My opening salvo drew what is–for me–a much, much wider audience than almost anything else I’ve written. So that’s great, but it also ups the pressure for this second post. We’ll see if I’m up to it.
I am trying to keep these relatively brief (< 1K words), as my goal is to build an ongoing context/set of hypotheses to work from over time. That may bite me, as it prevents some deeper data dives, but I’ll incorporate those separately as I can.
Finally, a tip of the hat to fantastic designer Juhan Sonin (@JSonin) for the title and rallying cry.}
I want to work the baseball comparison a little more.
There was a moment in the late 90s where I had four different websites running off various versions of databases that contained complete historical information for the entirety of MLB history. (This was before baseball-reference took over the world–I mean that in a good way.) One of these had been modified and extended to include minor league players, another for hundreds of players from the Negro Leagues, another tracked fantasy performance in relation to actual. And this was pretty easy to do. The core data and its structure was easily available, and importing it into SQL (it was the 90s) and optimizing it, writing stored procedures, etc was all relatively straightforward given a moderate amount of technical skill.
This was the baseline, and it was pretty rich: yearly performance totals for every player and every team, ever. Sure, the data flaked out the further back or further afield you went, but it was a solid platform for building and testing notions and working hypotheses.
Note the focus: I am most interested in player performance. There is some great work being done on team performance, and, yay, that’s fantastic. But I am most fascinated by how the individuals combine to create that performance. In sabermetric terms, I am more interested in analyzing Win Shares than Wins.
This is entirely missing from the soccermetrics ecology.
It gets back to one of the points from my Opening Salvo: we aren’t counting useful things.
And I don’t see a solve for this: I don’t think there are enough useful things to count on a seasonal basis. I enhanced that sentence because it’s damn important: the things that are emerging that seem worth counting are almost exclusively context- and situation- dependent. The data we have barely squeaks at the player level.
So, let’s dive into that for a moment, again working the baseball parallel.
I really don’t think it is possible to overestimate the importance of Retrosheet (and, by extension, the tireless efforts of Dave Smith) in the growth of sabermetrics. Retrosheet isn’t sexy, it isn’t fancy, there’s no infinite scroll, 3-column theme, there’s very little to invite the casual user into its labyrinthine depths. There are also no ads.
What it is: a thoroughly vetted, crowd-sourced, amateur (in the absolutely best sense of the word) repository.
It was among the earliest successful efforts at statistical crowdsourcing. Retrosheet would help a researcher gain access to box scores (often grainy photocopies of newspaper accounts or scoresheets provided by people who attended the games), the researcher would convert that information into a digital, structured form and send it back. Over time, this meant a game by game, and in many cases, play by play, database emerged that was publicly accessible and queryable. Suddenly, if you wanted to know what Honus Wagner did on July 17, 1910, you could. That’s pure esoterica. But the data generated significant research as well.
There was a source of trusted information–and if you couldn’t find it, Tom Ruane could and usually would.
So, we have two things that were simultaneously available/emerging: a standard for game-by-game reporting and easy access to important season-over-season historical information.
This allowed a much wider range of innovation, and while there were many, many missteps, it was critical to a true understanding of the game. Not only did Bill James famously “break the wand,” there was also nothing that he did that other people could not have done. I mean, short of being less dedicated, less innovative, less determined, and less entertaining than James.
The parallel here is to the various game-state data systems that exist in proprietary forms right now: things that track field position, activity, time, stuff like that (that is, with the score 2-1, Juan Mata attempted a cross from x,y coordinate on the right wing and a foul was called on Marouane Fellaini).
Not only is there nothing like this for soccer, nothing seems coming on the horizon.
The reason, essentially, is profit. And, look, I respect the profit drive. I think, yes, Opta (and all the rest) should profit from their work. See here for a decent overview of the issue.
But it’s not an either/or situation (it rarely is). The possibilities are legion, but the obvious solve is for those organization to release partial data sets (entire past seasons, all data for a single team, all data for a dozen players, whatever–you can slice this however you want). But release it in an easily downloadable, structured, form.
Yes, the data will be imperfect; yes, the older data is likely not nearly as rich as current information. But you know what?
It won’t matter.
That information will yield both gold and dross. And Opta (or whomever) would retain their competitive edge by incorporating the gold into their current and future products.
Somehow, some way, we have to let the data scream, loudly and publicly.