Walk Like a Sabermetrician: Run Distribution and W%, 2017

I always start this post by looking at team records in blowout and non-blowout games. This year, after a Twitter discussion with Tom Tango, I’ve narrowed the definition of blowout to games in which the margin of victory is six runs or more (rather than five, the definition used by Baseball-Reference and that I had independently settled on). In 2017, the percentage of games decided by x runs and by >= x runs are:

If you draw the line at 5 runs, then 21.6% of games are classified as blowouts. At 6 runs, it drops to 15.3%. Tango asked his Twitter audience two different but related poll questions. The first was “what is the minimum margin that qualifies a game as a blowout?”, for which the plurality was six (39%, with 5, 7, and 8 the other options). The second was “what percentage of games do you consider to be blowouts?”, for which the plurality was 8-10% (43%, with the other choices being 4%-7%, 11%-15%, and 17-21%). Using the second criterion, one would have to set the bar at a margin of seven, at least in 2017.

As Tango pointed out, it is of interest that asking a similar question in different ways can produce different results. Of course this is well-known to anyone with a passing interest in public opinion polling. But here I want to focus more on some of the pros and cons of having a fixed standard for a blowout or one that varies depending on the actual empirical results from a given season.

A variable standard would recognize that as the run environment changes, the distribution of victory margin will surely change (independent of any concurrent changes in the distribution of team strength), expanding when runs are easier to come by. Of course, this point also means that what is a blowout in Coors Field may not be a blowout in Petco Park. The real determining factor of whether a game is a blowout is whether the probability that the trailing team can make a comeback (of course, picking one standard and applying to all games ignores the flow of a game; if you want to make a win probability-based definition, go for it).

On the other hand, a fixed standard allows the percentage of blowouts to vary over time, and maybe it should. If the majority of games were 1-0, it would sure feel like there were vary few blowouts even if the probability of the trailing team coming back was very low. Ideally, I would propose a mixed standard, in which the margin necessary for a blowout would not be a fixed % of games but rather somehow tied to the average runs scored/game. However, for the purpose of this post, Tango’s audience answering the simpler question is sufficient for my purposes. I never had any strong rationale for using five, and it does seem like 22% of games as blowouts is excessive.

Given the criterion that a blowout is a game in which the margin of victory was six or more, here are team records in non-blowouts:

Records in blowouts:

The difference between blowout and non-blowout records (B - N), and the percentage of games for each team that fall into those categories:

Keeping in mind that I changed definitions this year (and in so doing increased random variation if for no reason other than the smaller percentage of games in the blowout bucket), it is an oddity to see two of the very best teams in the game (HOU and WAS) with worse records in blowouts. Still, the general pattern is for strong teams to be even better in blowouts, per usual. San Diego stands out as the most extreme team, with an outlier poor record in blowouts offsetting an above-.500 record in non-blowouts, although given that they play in park with the lowest PF, their home/road discrepancy between blowout frequency should theoretically be higher than most teams.

A more interesting way to consider game-level results is to look at how teams perform when scoring or allowing a given number of runs. For the majors as a whole, here are the counts of games in which teams scored X runs:

The “marg” column shows the marginal W% for each additional run scored. In 2017, three was the mode of runs scored, while the second run resulted in the largest marginal increase in W%.

The major league average was 4.65 runs/game; at that level, here is the estimated probability of scoring x runs using the Enby Distribution (stopping at fifteen):

In graph form (again stopping at fifteen):

This is a pretty typical visual for the Enby fit to the major league average. It’s not perfect, but it is a reasonable model.

In previous years I’ve used this observed relationship to calculate metrics of team offense and defense based on the percentage of games in which they scored or allowed x runs. But I’ve always wanted to switch to using theoretical values based on the Enby Distribution, for a number of reasons:

1. The empirical distribution is subject to sample size fluctuations. In 2016, all 58 times that a team scored twelve runs in a game, they won; meanwhile, teams that scored thirteen runs were 46-1. Does that mean that scoring 12 runs is preferable to scoring 13 runs? Of course not--it's a quirk in the data. Additionally, the marginal values don’t necessary make sense even when W% increases from one runs scored level to another.

2. Using the empirical distribution forces one to use integer values for runs scored per game. Obviously the number of runs a team scores in a game is restricted to integer values, but not allowing theoretical fractional runs makes it very difficult to apply any sort of park adjustment to the team frequency of runs scored.

3. Related to #2 (really its root cause, although the park issue is important enough from the standpoint of using the results to evaluate teams that I wanted to single it out), when using the empirical data there is always a tradeoff that must be made between increasing the sample size and losing context. One could use multiple years of data to generate a smoother curve of marginal win probabilities, but in doing so one would lose centering at the season’s actual run scoring rate. On the other hand, one could split the data into AL and NL and more closely match context, but you would lose sample size and introduce more quirks into the data.

So this year I am able to use the Enby Distribution. I have Enby Distribution parameters at each interval of .05 runs/game. Since it takes a fair amount of manual work to calculate the Enby parameters, I have not done so at each .01 runs/game, and for this purpose it shouldn’t create too much distortion (more on this later). The first step is to take the major league average R/G (4.65) and park-adjust it. I could have park-adjusted home and road separately, and in theory you should be as granular as practical, but the data on teams scoring x runs or more is not readily available broken out between home and road. So each team’s standard PF which assumes a 50/50 split of home and road games is used. I then rounded this value to the nearest .05 and calculated the probability of scoring x runs using the Enby Distribution (with c = .852 since this exercise involves interactions between two teams).

For example, there were two teams that had PFs that produced a park-adjusted expected average of 4.90 R/G (ARI and TEX). In other words, an average offense playing half their games in Arizona's environment should have scored 4.90 runs/game; an average defense doing the same should have allowed 4.90 runs/game. The Enby distribution probabilities of scoring x runs for a team averaging 4.90 runs/game are:

For each x, it’s simple to estimate the probability of winning. If this team scores three runs in a particular game, then they will win if they allow 0 (4.3%), 1 (8.6%), or 2 runs (12.1%). As you can see, this construct assumes that their defense is league-average. If they allow three, then the game will go to extra innings, in which case they have a 50% chance of winning (this exercise doesn’t assume anything about inherent team quality), so in another 13.6% of games they win 50%. Thus, if this the Diamondbacks score three runs, they should win 31.8% of those games. If they allow three runs, it’s just the complement; they should win 68.2% of those games.

Using these probabilities and each team’s actual frequency of scoring x runs in 2017, I calculate what I call Game Offensive W% (gOW%) and Game Defensive W% (gDW%). It is analogous to James’ original construct of OW% except looking at the empirical distribution of runs scored rather than the average runs scored per game. (To avoid any confusion, James in 1986 also proposed constructing an OW% in the manner in which I calculate gOW%, which is where I got the idea).

As such, it is natural to compare the game versions of OW% and DW%, which consider a team’s run distribution, to their OW% and DW% figured using Pythagenpat in a traditional manner. Since I’m now using park-adjusted gOW%/gDW%, I have park-adjusted the standard versions as well. As a sample calculation, Detroit averaged 4.54 R/G and had a 1.02 PF, so their adjusted R/G is 4.45 (4.54/1.02). OW% The major league average was 4.65 R/G, and since they are assumed to have average defense we use that as their runs allowed. The Pythagenpat exponent is (4.45 + 4.65)^.29 = 1.90, and so their OW% is 4.45^1.90/(4.45^1.90 + 4.65^1.90) = .479, meaning that if the Tigers had average defense they would be estimated to win 47.9% of their games.

In previous year’s posts, the major league average gOW% and gDW% worked out to .500 by definition. Since this year I’m 1) using a theoretical run distribution from Enby 2) park-adjusting and 3) rounding team’s park-adjusted average runs to the nearest .05, it doesn’t work out perfectly. I did not fudge anything to correct for the league-level variation from .500, and the difference is small, but as a technical note do consider that the league average gOW% is .497 and the gDW% is .503.

For most teams, gOW% and OW% are very similar. Teams whose gOW% is higher than OW% distributed their runs more efficiently (at least to the extent that the methodology captures reality); the reverse is true for teams with gOW% lower than OW%. The teams that had differences of +/- 2 wins between the two metrics were (all of these are the g-type less the regular estimate, with the teams in descending order of absolute value of the difference):

Positive: SD, TOR
Negative: CHN, WAS, HOU, NYA, CLE

The Cubs had a standard OW% of .542, but a gOW% of .516, a difference of 4.3 wins which is the largest such discrepancy for any team offense/defense in the majors this year. I always like to pick out this team and present a graph of their runs scored frequencies to offer a visual explanation of what is going on, which is that they distributed their runs less efficiently for the purpose of winning games than would have been expected. The Cubs average 5.07 R/G, which I’ll round to 5.05 to be able to use an estimated distribution I have readily available from Enby (using the parameters for c = .767 in this case since we are just looking at one team in isolation):

The Cubs were shutout or held to one run in eight more games than one would expect for a team that average 5.05 R/G; of course, these are games that you almost certainly will not win. They scored three, four, or six runs significantly less than would be expected; while three and four are runs levels at which in 2017 you would expect to lose more often then you win, even scoring three runs makes a game winnable (.345 theoretical W% for a team in the Cubs’ run environment). The Cubs had 4.5 fewer games scoring between 9-12 runs than expected, which should be good from an efficiency perspective, since even at eight runs they should have had a .857 W%. But they more than offset that by scoring 13+ runs in a whopping 6.8% of their games, compared to an expectation of 2.0%--7.8 games more than expected where they gratuitously piled on runs. Chicago scored 13+ runs in 11 games, with Houston and Washington next with nine, and it’s no coincidence that they were also very inefficient offensively.

The preceding paragraph is an attempt to explain what happened; despite the choice of non-neutral wording, I’m not passing judgment. The question of whether run distribution is strongly predictive compared to average runs has not been studied in sufficient detail (by me at least), but I tend to think that the average is a reasonable indicator of quality going forward. Even if I’m wrong, it’s not “gratuitous” to continue to score runs after an arbitrary threshold with a higher probability of winning has been cleared. In some cases it may even be necessary, as the Cubs did have three games in which they allowed 13+ runs, although they weren’t the same games. As we saw earlier, major league teams were 111-0 when scoring 13+ runs, and 294-17 when scoring 10-12.

Teams with differences of +/- 2 wins between gDW% and standard DW%:

Positive: SD, CIN, NYN, MIN, TOR, DET
Negative: CLE, NYA

San Diego and Toronto had positive differences on both sides of the ball; the Yankees and Cleveland had negative difference for both. Thus it is no surprise that those teams show up on the list comparing gEW% to EW% (standard Pythagenpat). gEW% combines gOW% and gDW% indirectly by converting both to equivalent runs/game using Pythagenpat (see this post for the methodology):

Positive: SD, TOR, NYN, CIN, OAK
Negative: WAS, CHN, CLE, NYA, ARI, HOU

The Padres EW% was .362, but based on the manner in which they actually distributed their runs and runs allowed per game, one would have expected a .405 W%, a difference of 6.9 wins which is an enormous difference for these two approaches. In reality, they had a .438 W%, so Pythagenpat’s error was 12.3 wins which is enormous in its own right.

gEW% is usually (but not always!) a more accurate predictor of actual W% than EW%, which it should be since it has the benefit of additional information. However, gEW% assumes that runs scored and allowed are independent of each other on the game-level. Even if that were true theoretically (and given the existence of park factors alone it isn’t), gEW% would still be incapable of fully explaining discrepancies between actual records and Pythagenpat.

The various measures discussed are provided below for each team.

Finally, here are the Crude Team Ratings based on gEW% since I hadn’t yet calculated gEW% when that post was published: