## Monday, July 09, 2012

### On Run Distributions, pt. 5: Estimating Variance

So far, I’ve demonstrated only that the zero-modified negative binomial distribution (which I’m referring to as the Enby distribution for the sake of my sanity) can provide a decent fit to actual scoring patterns when the sample variance of runs per game is known. In order for this approach to have value with teams for which we don’t know the variance (and that’s the whole point of the exercise--estimating what the distribution should be based on the average R/G rather than simply regurgitating a sample distribution), we need a way to approximate the variance (As a tease, some work being done independently should allow for a better approximation of the variance than what I've come up with here. At some point in the future, I will incorporate that method into my methodology).

Allow me to issue this disclaimer up front: the formula I’m proposing here is woefully inadequate. If the Enby distribution is to have any real value to sabermetricians, someone will have to come along and clean this part up (while I’m making a wish list, a better way of adjusting the parameters to return the correct average R/G after the zero modification would be nice too).

The scatterplot below shows the variance of R/G plotted against average R/G for each major league team, 1981-1996. You can see there is clearly a positive correlation between the two: the higher the average, the higher the variance: There is no clear pattern that will help me in attempting to develop a function to estimate variance from the mean. In fact, if you plot the ratio of variance to mean against the mean, you get a big clump: There is a positive correlation (r = +.27) between the ratio of variance/mean and the mean. However, using a regression equation to describe the relationship between the mean and the variance introduces the problem of illogical results at the extremes.
For example, a linear regression yields the following equation for variance as a function of mean:

Variance = 2.637*mean - 2.670

For any team scoring less than 1.013 R/G, this formula will predict a negative variance, which is obviously impossible. Granted, I’m under no delusions that the final method I offer will be of any use outside of the normal scoring range of major league teams, but I cringe at laying out a method that obviously cannot work at such extremes.

Another option is to estimate the variance ratio as a function of the mean. The benefit here is that this constrains the estimated variance; if a team scores zero runs, the estimated variance will be zero. The estimated variance can never be negative:

Variance/mean = .1345*mean + 1.430
So Variance = mean*(.1345*mean + 1.430) = 1.430*mean + .1345*mean^2

Neither of these equations comes close to matching the aggregate results at 4.46 R/G, which is troublesome. But that value is itself an amalgamation of hundreds of individual team seasons, each recorded by teams that theoretically follow their own distributions of runs scored, and so I’m not sure a failure to match the result should be a death knell. Further complicating matters is that the estimate of variance will only be used to fit initial r and B parameters, with r then being varied to ensure that the mean of the distribution equals the actual mean.

Let me try using the second formula and run through the whole process to generate the expected run distribution for the aggregate 4.4558 R/G:

1. Estimate the variance of R/G from the mean:

Variance = 1.43*mean + .1345*mean^2 = 1.43*4.4558 + .1345*4.4558^2 = 9.042

2. Fit the parameters of a Enby distribution assuming no zero-modification:

B = variance/mean - 1 = 9.042/4.4558 - 1 = 1.029
r = mean/B = 4.4558/1.029 = 4.33

3. Estimate the parameter z (variable RI = (R/G)/9 = 4.4558/9 = .4951):

z = (RI/(RI + .767*RI^2))^9 = (.4951/(.4951 + .767*.4951^2))^9 = .0552

4. Calculate the probability for k runs (where k >=1) using the zero-modified formula, then find a new value for r that sets the mean of the zero-modified distribution equal to the desired mean (4.4558).

This step can only be done via some kind of computer algorithm; I used trial and error and get r = 4.364.

For the first time in the series, we have a version of the Enby distribution that is not blatantly cheating compared to other methods: I am not treating the variance or the probability of being shutout as known values, but rather am estimating them. We’re still not flying completely solo--the formula for estimating variance from mean was based on the same data that’s been aggregated, but we’re getting closer.

What would the estimates look like if we tried to apply them to a really extreme team? I do not expect the Enby to perform well at all, but it’s worth checking to confirm. I’ll try to estimate the run distribution first for a team that averages 1.5 R/G, then for a team that averages 10. I didn’t select these numbers for any particular reason other than that they are extreme, without being so crazy as to be beyond the range that anyone could possibly care (for practical rather than theoretical reasons) if the method worked for that point.

Obviously we don’t have an actual run distribution to compare to, but I’ll compare to the Tango-Ben distribution which, while also untested at these extremes, would be a better bet. First, the 1.5 R/G team (parameters z = .3387, B = .6318, r = 2.5706): And the 10 R/G team (parameters z = .0039, B = 1.775, r = 5.638): I was honestly surprised when I saw how closely Enby tracked Tango-Ben. Pleasantly so, of course, but surprised nonetheless.

I’ve said this before in different ways, but it’s worth repeating: my experience working with probability distributions is either theoretical (I took a Stats course once where I rarely wrote a real number down for the entire class) or working with fixed distributions (i.e. you are given a Poisson distribution with parameter h = 2.05. What is the probability of three or fewer occurrences?) I have no practical knowledge of how to best fit a zero-modified distribution to sample data, and thus my work product here will be of little value. Hopefully I’ve provided enough promising results to encourage those of you who are skilled at this type of problem to consider the negative binomial as a model for runs per game.