Tuesday, June 20, 2017

Enby Distribution, pt. 2: Revamping the Variance Estimate

All models are approximations of reality, but some are more useful than others. The notion of being able to estimate the runs per game distribution cleanly in one algorithm (rather than patching together runs per inning distributions or using simulators) is one that can be quite useful in estimating winning percentage or trying to distinguish between the effectiveness of team offense beyond similar noting their runs scored total. I’d argue that a runs per game distribution is a fundamentally useful tool in classical sabermetrics.

However, while such a model would be useful, Enby as currently constructed falls well short of being an ideal tool. There are a few major issues:

1) It is not mathematically feasible to solve directly for the parameters of a zero-modified negative binomial distribution, which forces me to use trial and error to estimate Enby coefficients. In doing so, the distribution is no longer able to exactly match the expected mean and variance--instead, I have chosen to match the mean precisely, and hope that the variance is not too badly distorted.

2) The variance that we should expect for runs per game at any given level of average R/G is itself unknown. I developed a simple formula to estimate variance based on some actual team data, but that formula is far from perfect and there’s no particular reason to expect it to perform well outside of the R/G range represented by the data from which it was developed.

3) An issue with run distribution models found by Tom Tango in the course of his research on runs per inning distribution is that the optimal fit for a single team’s distribution may not return optimal results in situations in which two teams are examined simultaneously (such as using the distribution to model winning percentage). One explanation for this phenomenon is the covariance between runs scored and runs allowed in a given game, due to either environmental or strategic causes.

I have recently attempted to improve the Enby distribution by focusing on these obvious flaws. Unfortunately, my findings were not as useful as I had hoped they would be, but I would argue (hope?) that they represent at least small progress in this endeavor.

During the course of writing the original series on this topic, I was made aware of work being done by Alan Jordan, who was developing a spreadsheet that used the Tango Distribution to estimate scoring distributions and winning percentage. One of the underpinnings was that he found (or found work by Darren Glass and Phillip Lowry that demonstrated) that the variance of runs scored per inning as predicted by the Tango Distribution could be calculated as follows (where RI = runs per inning and c is the Tango Distribution constant):

Variance (inning) = RI*(2/c + RI - 1) = RI^2 + (2/c - 1)*RI

Assuming independence of runs per inning (this is a necessary assumption to use the Tango Distribution to estimate runs per game), the variance of runs per game will simply be nine times the variance of runs per inning (assuming of course that there are precisely nine innings per game, as I did in estimating the z parameter of Enby from the Tango Distribution). If we attempt to simply this further by assuming that RI = RG/9, where RG = runs per game:

Variance (game) = 9*(RI^2 + (2/c - 1)*RI) = 9*((RG/9)^2 + (2/c - 1)*RG/9) = RG^2/9 + (2/c - 1)*RG

The traditional value of c used to estimate runs per inning for one team is .767, so if we substitute that for c, we wind up with:

Variance (game) =1.608*RG + .111*RG^2

When I worked on this problem previously, I did not have any theoretical basis for an estimator of variance as a function of RG, so I experimented with a few possibilities and found what appeared to be a workable correlation between mean RG and the ratio of variance to mean. I used linear regression on a set of actual team data (1981-1996) and wound up with an equation that could be written as:

Variance (game) = 1.43*RG + .1345*RG^2

Note the similarities between this equation and the equation based on the Tango Distribution - they both take the form of a quadratic equation less the constant (I purposefully avoided constants in developing my variance estimator so as to avoid unreasonable results at zero and near-zero RG). The coefficients are somewhat different, but the form of the equation is identical.

On one hand, this is wonderful for me, because it vindicates my intuition that this was a reasonable way to estimate variance. On the other hand, this is very disappointing, because I had hoped that Jordan’s insight would allow me to significantly improve the variance estimate. Instead, any gains to be had here are limited to improving the equation by using a more theoretical basis to estimate its coefficients, but there is no change in the form of this equation.

In fact, any revision to the estimator will reduce accuracy over the 1981-96 sample that I am using, since the linear regression already found optimal coefficients for this particular dataset. This by no means should be taken as a claim on my part that the regression-based equation should be used rather than the more theoretically-grounded Tango Distribution estimate, simply an observation that any improvement will not show up given the confines of the data I have at hand.

What about data from out of that set? I have easy access to the four seasons from 2009-2012. In these seasons, major league teams have averaged 4.401 runs per game and the variance of runs scored per game is 9.373. My equation estimates the variance should be 8.90, while the Tango-based formula estimates 9.23. In this case, we could get a near-precise match by using c = .757.

While we know how accurate each estimator is with respect to variance for this case, what happens when we put Enby to use to estimate the run distribution? The Enby parameters for 4.40 RG using my original equation are (B = 1.0218, r = 4.353, z = .0569). If we instead use the Tango estimated variance of 9.23, the parameters become (B = 1.0970, r = 4.041, z = .0569). With that, we can calculate the estimated frequencies of X runs scored using each estimator and compare to the empirical frequencies from 2009-2012:

Eyeballing this, the Tango-based formula is closer for one run, but exacerbates the recurring issue of over-estimating the likelihood of two or three runs. It makes up for this by providing a better estimate at four and five runs, but a worse estimate at six. After that the two are similar, although the Tango estimate provides for more probability in the tail of the distribution, which in this case is consistent with empirical results.

For now, I will move on to another topic, but I will eventually be coming back to this form of the Tango-based variance estimate, re-estimating the parameters for 3-7 RG, and providing an updated Enby calculator, as I do feel that there are distinct advantages to using the theoretical coefficients of the variance estimator rather than my empirical coefficients.