Monday, February 01, 2016

Run Distribution and W%, 2015

Every year I state that by the time this post rolls around next year, I hope to have a fully functional Enby distribution to allow the metrics herein to be more flexible (e.g. not based solely on empirical data, able to handle park effects, etc.) And every year during the year I wind up deciding that writing articles about other topics or trying to finish my professional education or watching some terrible TV show like Haven is a bigger priority than explaining how Enby applies.

Enby is a zero-modified negative binomial model to calculate the probability that a team will score X runs in a game. It is without question my favorite of my own body of sabermetric work and yet for some reason the hardest for me to get motivated to write about. Were the problem that I needed to finish working on the model itself, it would be a huge priority (almost a compulsion) for me. But I did that a long time ago, and now just need to make it presentable. I’d say maybe next year but history suggests I’d be lying to you.

Anyway, there are some elements of Enby in this post, as I’ve written enough about the model to feel comfortable using bits and pieces. But I’d like to overhaul the calculation of gOW% and gDW% that are used at the end based on Enby, and I’m not ready to do that just yet given the deficiency of the material I’ve published on Enby.

Self-indulgence, aggrandizement, and deprecation aside, I need to caveat that this post in no way accounts for park effects. But that won’t come in to play as I first look at team record in blowouts and non-blowouts, with a blowout defined as 5+ runs. Obviously some five run games are not truly blowouts, and some are; one could probably use WPA to make a better definition of blowout based on some sort of average win probability, or the win probability at a given moment or moments in the game. I should also note that Baseball-Reference uses this same definition of blowout. I am not sure when they started publishing it; they may well have pre-dated by usage of five runs as the delineator. However, I did not adopt that as my standard because of Baseball-Reference, I adopted it because it made the most sense to me being unaware of any B-R standard.

73.9% of major league games in 2015 were non-blowouts (of course 26.1% were). The leading records in non-blowouts:

The three NL Central powerhouses top the list, with all playing a lot of non-blowout games as you’ll see in a moment (the Cubs had the second-highest percentage of non-blowouts, the Pirates fourth, and Cardinals seventh) and playing very well in those games. Of the three only Pittsburgh had a better record in blowouts, which is unusual as good teams tend to do better in blowouts:

The Blue Jays were an odd case as well, actually sub-.500 (56-57) in non-blowouts but dominant in them. The only other playoff team to be sub-.500 in either was Texas in blowouts (22-25).

This chart is sorted by the difference between blowout and non-blowout W% and includes the percentage of blowouts for each team:

A more interesting way to consider game-level results is to look at how teams perform when scoring or allowing a given number of runs. For the majors as a whole, here are the counts of games in which teams scored X runs:

The “marg” column shows the marginal W% for each additional run scored. In 2015, the fourth run was both the run with the greatest marginal impact on the chance of winning and the level of scoring for which a team was more likely to win than lose.

I use these figures to calculate a measure I call game Offensive W% (or Defensive W% as the case may be), which was suggested by Bill James in an old Abstract. It is a crude way to use each team’s actual runs per game distribution to estimate what their W% should have been by using the overall empirical W% by runs scored for the majors in the particular season.

The theoretical distribution from Enby discussed earlier would be much preferable to the empirical distribution for this exercise, but I’ve defaulted to the 2015 empirical data. Some of the drawbacks of this approach are:

1. The empirical distribution is subject to sample size fluctuations. In 2015, teams that scored 11 runs won 98.5% of the time while teams that scored 10 runs won 98.0% of the time. Does that mean that scoring 12 runs is preferable to scoring 11 runs? Of course not--it's a quirk in the data. Additionally, the marginal values don’t necessary make sense even when W% increases from one runs scored level to another (In figuring the gEW% family of measures below, I lumped games with 11 and 12 runs scored/allowed into one bucket, which smoothes any illogical jumps in the win function, but leaves the inconsistent marginal values unaddressed and fails to make any differentiation between scoring in that range. The values actually used are displayed in the “use” column, and the “invuse” column is the complements of these figures--i.e. those used to credit wins to the defense. I've used 1.0 for 13+ runs, which is a horrible idea theoretically. In 2015, teams were 81-0 when scoring 13 or more runs).

2. Using the empirical distribution forces one to use integer values for runs scored per game. Obviously the number of runs a team scores in a game is restricted to integer values, but not allowing theoretical fractional runs makes it very difficult to apply any sort of park adjustment to the team frequency of runs scored.

3. Related to #2 (really its root cause, although the park issue is important enough from the standpoint of using the results to evaluate teams that I wanted to single it out), when using the empirical data there is always a tradeoff that must be made between increasing the sample size and losing context. One could use multiple years of data to generate a smoother curve of marginal win probabilities, but in doing so one would lose centering at the season’s actual run scoring rate. On the other hand, one could split the data into AL and NL and more closely match context, but you would lose sample size and introduce more quirks into the data.

I keep promising that I will use Enby to replace the empirical approach, but for now I will use Enby for a couple graphs but nothing more.

First, a comparison of the actual distribution of runs per game in the majors to that predicted by the Enby distribution for the 2015 major league average of 4.250 runs per game (Enby distribution parameters are B = 1.0798, r = 3.966, z = .0619):

Enby didn’t predict enough shutouts or two run games, and too many three run games. There’s also a blip in the empirical data at eight runs scored (5.29% compared to 4.55% predicted by Enby). It doesn’t show up on the chart, but Enby predicted .35% of games with 16+ runs scored; the actual frequency was .31%.

I will not go into the full details of how gOW%, gDW%, and gEW% (which combines both into one measure of team quality) are calculated in this post, but full details were provided here and the paragraph below gives a quick explanation. The “use” column here is the coefficient applied to each game to calculate gOW% while the “invuse” is the coefficient used for gDW%. For comparison, I have looked at OW%, DW%, and EW% (Pythagenpat record) for each team; none of these have been adjusted for park to maintain consistency with the g-family of measures which are not park-adjusted.

A team’s gOW% is the sumproduct of their frequency of scoring x runs, where x runs from 0 to 22, and the empirical W% of teams in 2015 when they scored x runs. For example, Atlanta was shutout 17 times; they would not be expected to win any of those games (nor would they, we can be certain). They scored one run 20 times; an average team would have a .082 W% when scoring one run, so they could have been expected to win 1.64 of the twenty games given average defense. They scored two runs 32 times; an average team would have a .283 W% when scoring two, so they could have been expected to win 9.06 of those games given average defense. Sum up the estimated wins for each value of x and divide by the team’s total number of games and you have gOW%.

It is thus an estimate of what W% a team with the given team’s empirical distribution of runs scored and a league average defense would have. It is thus analogous to James’ original construct of OW% except looking at the empirical distribution of runs scored rather than the average runs scored per game. (To avoid any confusion, James in 1986 also proposed constructing an OW% in the manner in which I calculate gOW%).

For most teams, gOW% and OW% are very similar. Teams whose gOW% is higher than OW% distributed their runs more efficiently (at least to the extent that the methodology captures reality); the reverse is true for teams with gOW% lower than OW%. The teams that had differences of +/- 2 wins between the two metrics were (all of these are the g-type less the regular estimate):

Positive: TB, SEA, ATL, PHI, CHA, LA, STL
Negative: NYA, TOR, HOU

Teams with differences of +/- 2 wins between gDW% and standard DW%:

Positive: TEX, ATL, BOS, SEA
Negative: PIT, HOU, SF, STL, MIA

Pittsburgh’s defense allowed 3.679 runs per game, which one would expect to result in a .565 W% with average offense. But based on their runs allowed distribution, one would only expect a .540 W% paired with an average offense. That difference of 4.1 wins was the greatest absolute difference on offense and defense for any major league team, so it may be instructive to look at a graph of their runs allowed distribution and what Enby would predict for such a team (B = 1.0163, r = 3.655, z = .0859):

Pittsburgh had many fewer one-run games than one would expect (actual 8.0%, Enby estimate 14.1%), but allowed two to five runs more than would be expected and allowed eight or more runs 8.0% of the time versus an expectation of 9.4%.

Teams with differences of +/- 2 wins between gEW% and standard EW%:

Positive: SEA, ATL, CHA, PHI, OAK, TB, COL
Negative: HOU, TOR, PIT, NYA, WAS, NYN, SF

It’s no surprise that SEA, ATL, and HOU appear prominently as they were the only teams to have both their offense and defense appear on the positive and negative lists in the same direction. Even with bad clustering of both runs scored and runs allowed, Houston was a good team, but their gEW% of .539 tracks their actual W% of .531 better than their EW% of .576. In 2015, the RMSE of gEW% as a predictor of W% was about 4.4 wins, while EW% had a RMSE of 4.7 wins (gEW% usually, but not always over a thirty team sample, performs better as it should given the advantage of knowing the actual distribution of runs scored and allowed, even treating them independently.)

One might think that the blessed Royals, given their well-known ability to hit at the right time and play the game the right way and so many other attributes that make them so very dear to media members everywhere, would have clustered their runs efficiently. Especially their offense. But they really didn’t. KC’s gOW% was .524; their standard OW% was .524. Their run distribution, converted to equivalent wins with an average defense, was pretty much exactly what you would expect for a team that averaged 4.47 R/G. Their defense was slightly less efficient, with a .529 gDW% and .533 standard DW%. Where Kansas City made hay was the difference between their gEW% (and standard EW%) and their actual W%, which would necessarily result from a more efficient pairing of runs scored and runs allowed. It is quite tempting to credit the bullpen for this, as in theory bullpens can be strategically deployed given the game circumstances and thus increase the covariance between runs scored and allowed. But any such deviation for the Royals falls under the standard deviation from Pythagorean expectation and not anything special in the way the offense or defense alone distributed their runs.

Below is a full chart with the various actual and estimated W%s:


  1. Does "the deficiency of the material I’ve published on Enby" refer to and update to the calculation estimating variance?

    Based on the Enby series published, it looks like variance would be 8.5069, B=1.0016, and r=4.2933 (given 4.25 R/G).

    Since here you have B=1.0798 here, r=3.9660, and you mentioned variance is "an area that might well be improved upon" (in the series review), does that mean the variance estimate was in deed improved upon and that is just part of the Enby work that is yet-to-be written about?

  2. Yes, exactly. Alan Jordan was able to estimate variance using the Tango Distribution and I have incorporated that into Enby. The cool thing is that the estimate is of the form variance = x*RPG^2 + y*RPG which is what I came up with, but the coefficients are different. The bad thing is that it isn't that much different than my original estimate, so there isn't a huge difference in accuracy.


I reserve the right to reject any comment for any reason.