Walk Like a Sabermetrician: Run Distribution and W%, 2013

A couple of caveats apply to everything that follows in this post. The first is that there are no park adjustments anywhere. There's obviously a difference between scoring 5 runs at Petco and scoring 5 runs at Coors, but if you're using discrete data there's not much that can be done about it unless you want to use a different distribution for every possible context. Similarly, it's necessary to acknowledge that games do not always consist of nine innings; again, it's tough to do anything about this while maintaining your sanity.

All of the conversions of runs to wins are based only on 2013 data. Ideally, I would use an appropriate distribution for runs per game based on average R/G, but I've taken the lazy way out and used the empirical data for 2013 only. (I have a methodology I could use to do estimate win probabilities at each level of scoring that take context into account, but I’ve not been able to finish the full write-up it needs on this blog before I am comfortable using it without explanation).

The first breakout is record in blowouts versus non-blowouts. I define a blowout as a margin of five or more runs. This is not really a satisfactory definition of a blowout, as many five-run games are quite competitive--"blowout” is just a convenient label to use, and expresses the point succinctly. I use these two categories with wide ranges rather than more narrow groupings like one-run games because the frequency and results of one-run games are highly biased by the home field advantage. Drawing the focus back a little allows us to identify close games and not so close games with a margin built in to allow a greater chance of capturing the true nature of the game in question rather than a disguised situational effect.

In 2013, 74.7% of major league games were non-blowouts while the complement, 25.3%, were. Team record in non-blowouts:

And in blowouts:

Teams sorted by difference between blowout and non-blowout W%, as well as the percentage of blowouts for each team:

Baltimore is one of the teams that interest me here; their unbelievable one-run record in 2012 was well-documented, and so it shouldn’t surprise that the Orioles ranked second in the majors in 2012 in non-blowout W% but were just over .500 in non-blowouts (23-21). In 2013, Baltimore just quit playing in blowouts, with only 15% of their games decided by five or more runs (only the White Sox at 17% joined them under 20% blowouts), but when they did they had a 14-11 record. Boston had the largest W% differential between blowouts and non-blowouts and were also the best team in the majors per most result-based perspectives.

A more interesting way to consider game-level results is to look at how teams perform when scoring or allowing a given number of runs. For the majors as a whole, here are the counts of games in which teams scored X runs:

The “marg” column shows the marginal W% for each additional run scored. In 2013, the second run was the marginally most valuable while the fourth was the cutoff point between winning and losing.

I use these figures to calculate a measure I call game Offensive W% (or Defensive W% as the case may be), which was suggested by Bill James in an old Abstract. It is a crude way to use each team’s actual runs per game distribution to estimate what their W% should have been by using the overall empirical W% by runs scored for the majors in the particular season.

A theoretical distribution would be much preferable to the empirical distribution for this exercise, but as I mentioned earlier I haven’t yet gotten around to writing up the requisite methodological explanation, so I’ve defaulted to the 2013 empirical data. Some of the drawbacks of this approach are:

1. The empirical distribution is subject to sample size fluctuations. In 2013, teams that scored 7 runs won 85.8% of the time while teams that scored 8 runs won 83.2% of the time. Does that mean that scoring 7 runs is preferable to scoring 8 runs? Of course not--it's a quirk in the data. Additionally, the marginal values don’t necessary make sense even when W% increases from one runs scored level to another (In figuring the gEW% family of measures below, I lumped all games with 7 and 8 runs scored/allowed into one bucket, which smoothes any illogical jumps in the win function, but leaves the inconsistent marginal values unaddressed and fails to make any differentiation between scoring 7 and 8. The values actually used are displayed in the “use” column, and the “invuse” column is the complements of these figures--i.e. those used to credit wins to the defense. I've used 1.0 for 12+ runs, which is a horrible idea theoretically. In 2013, teams were 102-0 when scoring 12 or more runs).

2. Using the empirical distribution forces one to use integer values for runs scored per game. Obviously the number of runs a team scores in a game is restricted to integer values, but not allowing theoretical fractional runs makes it very difficult to apply any sort of park adjustment to the team frequency of runs scored.

3. Related to #2 (really its root cause, although the park issue is important enough from the standpoint of using the results to evaluate teams that I wanted to single it out), when using the empirical data there is always a tradeoff that must be made between increasing the sample size and losing context. One could use multiple years of data to generate a smoother curve of marginal win probabilities, but in doing so one would lose centering at the season’s actual run scoring rate. On the other hand, one could split the data into AL and NL and more closely match context, but you would lose sample size and introduce more quirks into the data.

I will use my theoretical distribution (Enby, which you can read about here) for a few charts in this post. The first is a comparison of the frequency of scoring X runs in the majors to what would be expected given the overall major league average of 4.166 R/G (Enby distribution parameters are r = 3.922, B = 1.07, z = .0649):

Enby generally does a decent job of estimating the actual scoring distribution, and while I am certainly not an unbiased observer, I think it does so here as well.

I will not go into the full details of how gOW%, gDW%, and gEW% (which combines both into one measure of team quality) are calculated in this post, but full details were provided here. The “use” column here is the coefficient applied to each game to calculate gOW% while the “invuse” is the coefficient used for gDW%. For comparison, I have looked at OW%, DW%, and EW% (Pythagenpat record) for each team; none of these have been adjusted for park to maintain consistency with the g-family of measures which are not park-adjusted.

For most teams, gOW% and OW% are very similar. Teams whose gOW% is higher than OW% distributed their runs more efficiently (at least to the extent that the methodology captures reality); the reverse is true for teams with gOW% lower than OW%. The teams that had differences of +/- 2 wins between the two metrics were (all of these are the g-type less the regular estimate):

Positive: CHA, MIL, CHN, BAL, MIA, PIT, MIN
Negative: BOS, OAK, STL, TEX, CLE

There were an abnormally high number of teams this season whose gOW% diverged significantly from their standard OW%; as you’ll see in a moment, the opposite was true for gDW%. The White Sox gOW% of .467 was 3.5 games better than their OW% of .445. Their gOW% was seventh-lowest in the majors, but their OW% was second-worst. So while their offense was still bad, they wound up distributing their runs in a manner that should have resulted in more wins than one would expect from their R/G average.

As such, Chicago makes for an interesting case study in how a measly 3.69 runs/game can be doled out more efficiently. The black line is Chicago’s actual 2013 run distribution, the blue line is Enby’s estimate for a team averaging 3.691 R/G (r = 3.662, B = 1.018, z = .0853), and the red line is that of the majors as a whole (Chicago did not actually score more than twelve runs in a game this season, but fifteen is the standard I’ve always used in these graphs):

Chicago scored 3, 4, and 5 runs significantly more often than Enby would expect and more often that the major league average despite having a poor offense. 3-5 runs is a good spot to be in, at least in the current scoring environment--in 2013, teams won 54% of the time when scoring 3-5.

I deliberately wrote the preceding paragraph to be a little misleading--Chicago's propensity to score 3-5 runs was not really a positive, since it meant fewer games in which they scored more than five runs. The White Sox were shutout more often than the major league average (8% to 6.8%), scored < 2 runs more often than average (19.1% to 18%), but scored < 3 runs less often than average (50.6% to 47.8%). That is the only step at which Chicago was above average, and they quickly fell into well below average territory--Chicago scored < 6 runs 82% of the time versus the average of 71.9%:

Teams with differences of +/- 2 wins between gDW% and standard DW%:

Positive: SEA
Negative: ATL, TEX, OAK

The 3.7 win discrepancy between Atlanta’s gDW% (.570) and standard DW% (.592) was the largest such difference for any unit in the majors (greater than Chicago’s gOW% difference). The Braves were the only team which did not allow eleven or more runs in a game; the average was 3.4% and only Oakland (one) and St. Louis (two) had fewer than three such games. Avoiding those disaster games helped keep their RA/G low, but the Braves allowed four and five runs more often than both the Enby expectation for a team allowing 3.383 runs per game (r = 3.478, B = .983, z = .1023) would predict and the major league average:

Teams with differences of +/- 2 wins between gEW% and EW% (standard Pythagenpat):

Positive: SEA, CHA, PHI, PIT, MIN, CHN
Negative: OAK, TEX, STL, ATL, BOS, CLE, CIN

The negative list includes all playoff teams which obviously were not too badly hampered by seemingly inefficient run distributions. Standard Pythagenpat had a freakishly good year predicting actual W% in 2013, with a RMSE of 3.66 while gEW% had a 3.95 RMSE. gEW% does not incorporate any knowledge about the joint distribution of runs scored and allowed; if you do that, you may as well just look at actual win-loss record. But since it doesn’t have knowledge of the joint distribution, it’s quite possible for standard EW% to perform better as a predictor.

For now most of the applications of this methodology, at least in my writings, have been freak show in nature. The more interesting questions will be easier to investigate once I’ve finished my update of the Enby methodology. Do certain types of offenses tend to bunch their runs more efficiently? Can the estimate of variance of runs scored (which is really the key assumption underpinning Enby) be improved by considering team characteristics? How well do efficient or non-efficient distributions by teams predict team performance in future years? I don’t mean to imply that others have not investigated these questions, simply that I hope to have more interesting material in these year-end reviews starting in 2014. I said that last year too though.