## Monday, June 18, 2012

### On Run Distributions, pt. 3: Zero Modification

It appears that the negative binomial distribution provides a decent starting point for estimating the per game run distribution for typical major league teams. However, even a quick glance at the results suggests that it underestimates the probability of a team being shutout, and overestimates the probability of scoring a moderate number of runs (2-4).

While what I’m going to propose in this piece may appear to just be a crude fudge factor, but in fact distributions with modifications at zero are in use (Google “zero modified distributions” or something similar if you have to see for yourself). Modifying the negative binomial distribution to allow more probability to be placed at the point 0 runs will obviously improve our estimate of that probability, but should also result in more accurate estimates across the board.

I’ll let the variable a represent the probability of zero using the negative binomial distribution without modification, while z will represent the new probability that we wish to place on the zero point (in effect, it is now a third parameter along with r and B). So:

a = (1 + B)^(-r)

To calculate the probability of j runs (where j >=1) in this new distribution, all we do is assume that the share of non-zero probability assigned to each j stays the same--the only difference is that there is now a different amount of non-zero probability available. Recall that:

q(k) = (r)(r + 1)(r + 2)(r + 3)...(r + k - 1)*B^k/(k!*(1 + B)^(r + k)) for k >=1

So:

p(k) = z for k = 0
p(k) = (1 - z)*q(k)/(1 - a) for k >=1

In order to demonstrate this, I’m going to look at the three groupings of 1981-1996 teams I used last time: all teams and the 25 lowest and highest scoring teams. In this case I will set z equal to the actual observed frequency of scoring zero runs. Obviously, this is cheating if we are trying to establish the accuracy of the model and not helpful in cases in which we don’t know the empirical probability of zero runs. Again, though, I’m doing it this way as a quick check of whether or not we’re wasting our time. If you can’t make material gains in the accuracy with a known value, good luck trying to do so while estimating that value.

For the average teams, z = .0578 and a was previously calculated to be .0495. We previously estimated that the probability of scoring 3 runs was .1468, so the new estimated probability of scoring 3 runs is:

p(3) = (1 - .0578)*.1468/(1 - .0495) = .1455

Here’s a chart:

The model is still guessing too high on the probability of scoring a moderate number of runs, but at least now we no longer have one point which is an obvious problem. Going back to the X^2 statistic I discussed previously, the X^2 (excluding zero runs, since we’ve set this up in such a manner that the error at the point is zero by definition) for the zero-modified model is 63 compared to 68 for the unmodified distribution. That’s a slight gain in accuracy; in order for it to be worth the extra effort, the gain in better estimating shutouts has to be worth the extra work. For the lowest and highest scoring teams, the gains are similar.

Unfortunately, the zero-modification has a side effect which reduces the accuracy and (more importantly) destroys a fundamental property that our run distribution model must adhere to. I’d posit there are two such fundamental properties--obviously there are other aspects to consider, but these two are non-negotiable:

1. The model must return a total probability of 1. That is, p(0) + p(1) + p(2) + … + p (infinity) = 1. This property is not affected by the zero-modification.

2. The model must return the correct average runs scored. That is, 0*p(0) + 1*p(1) + 2*p(2) + 3*p(3) + ... = R/G. This property is not preserved by the zero-modified distribution.

Remember that when I fit the parameters initially, I used the method of moments, which relies on the mean of the distribution. The mean of a negative binomial distribution is r*B, and by using the method of moments I ensured that the mean and variance of the fitted distribution would equal the mean and variance of the sample.

In the literature on zero-modified distributions (see here for an example) one can find formulas for the mean and variance of the modified distributions. Ideally, we could just use the method of moments in conjunction with these formulas and fit the distributions properly. Unfortunately, that’s not possible:

mean = (1 - z)*r*B/(1 - (1 + B)^(-r))
variance = (1 - z)*rB*((1 + B) - (1 + B + r*B)*(1 + B)^(-r))/((1 - (1 + B)^(-r))^2 + z*(1 - z)*((1 - z)*r*B/(1 - (1 + B)^(-r)))^2

In fact, even the formula for the mean makes it impossible to algebraically force the mean of the modified distribution to equal the sample mean. Even if you hold either r or B constant, there’s no simple algebraic solution.

This complicates the decision between the modified and standard distribution considerably; if we use the standard distribution, we know that we’ll underestimate the probability of a shutout for all manner of teams. If we use the modified distribution, we won’t retain the same mean and variance as observed in the sample. What to do?

From my limited testing, it appears as if the least disruptive approach is to set the value of r such that the mean is equal to the sample mean. By least disruptive, I mean that the variance of the modeled distribution is closer to the variance of the sample when r is changed rather than B. For the sample of all teams, the revised value of r necessary is approximately 4.09, which results in a variance of 9.5663 (the sample variance is 9.3603). Changing B would result in a new B value of approximately 1.112 with a variance of 9.61.

Ideally both r and B would be selected in such a way that would maximize accuracy, but that is beyond my capabilities given the complexity of the model and three parameters in play. So I’ll have to settle for a crude solution, which will be to vary r.

Since there is no algebraic solution to calculate the necessary value of r, this will require an ad hoc approach. For those that are interested, I have gone through the trouble of figuring the parameters for common R/G levels and will present those at the end of this series.

Here is a graph of the modified distribution:

We’re still underestimating the likelihood of 2-4 runs and overestimating the likelihood of 5-7 runs, but I’m unaware of any other discrete distribution that comes close to this level of accuracy.

I’m getting a little tired of writing "negative binomial" and "modified", and so I will slap a cutesy name on this method. I’m not going to name it after myself, of course (for the record, I didn’t push the use of “Pythagenpat”, and this is only a new application of a common probability distribution discovered centuries ago). Until someone else decides they want to use this method and give it a better name, I’ll go with the Enby Distribution. Abbreviate “negative binomial” to “NB” and try spelling it as a word, and you get Enby.

1. If the c value of the Tango distribution were 1, then RI would be geometrically distributed (NB distributed with r=1). It would really be easy to calculate RG, just use NB with r=9 (the sum of geometric or NB with the same B are NB with r=sum of r's). Unfortunately that doesn't work with zero inflated NB's. I don't know why r seems to be around 4 for the RG distribution.

It would be nice to find a simple easy to work with continuous approximation to the NB that would allow analytic derivatives so a win estimator could be derived.

2. Yes, it would be wonderful if R/I followed the geometric distribution, as then runs in any N innings would just be a negative binomial with r = N. For whatever reason, in reality the R/I and R/G distributions don't easily relate, which I think is one of the reasons why there's been relatively so little published on estimating it.

With respect to a continuous approximation, there are some articles out there on using the Weibull distribution and a paper by Steven Miller on using Weibull to approximate the Pythagorean relationship. I believe I linked a couple in the first installment of this series if you are interested.

3. I've read that paper, in BTN I believe. The problem with Weibull is that it leads to the pyth and we can already do better than the pyth.

I believe the track to follow is BVL's algorithm:
F(n)=dr^n * f(0)^9 * q

where q is sum i=1 to min(i,9)of (9 i)*(n-1 i-1)*(c/f(0))^i

i=1 gets you a constant (power=0)
i=2 gets you a line (power=1)
i=3 gets you a quad (power=2)
...
i=9 gets you (power=8)

The sum (q) is an 8th degree polynomial.

It's a hairy function, but it seems continuous which would take us out of the realm of discrete math and into calculus.

(9 i)*(n-1 i-1) is the same regardless of RI and c.

4. What method do you have in mind when you say we can already do better than Pythagorean? I'm not aware of anything more accurate than Pythagorean with a custom exponent. The value of the exponent in Miller's derivation is not fixed, but rather one of the parameters that defines the shape of his Weibull distribution (if my recollection is correct--it's been a while since I read it and the math was challenging for me anyway).

5. My understanding is that they are fixed for a league in a given time period. If your exponent is 1.74 (Max Like) then it's 1.74 for the league in a given time period (AL 2004?). If it's 1.79 (Least Sqs) then it's 1.79 for the whole league.

He hasn't derived the pythagenpat or pythagenport. He's simply derived Bill James' fixed exponents with exponents resulting from means and variances. Deriving the pythagenpat or pythagenpor would require a distribution (at least somewhat) different from the Weibull.

I think the convolutions of the Tango are a road to an estimator. I just don't have the skills to do it.

6. I agree completely that he hasn't derived a way to calculate custom exponent. However, the fact that the parameter varies by context indicates that it is picking up on a similar phenomenon to what drives custom exponents.

Or to put it another way, I don't see any reason to believe that the Tango-based method of estimating R/G distribution is any more accurate than Weibull (perhaps it is, but that has not yet been demonstrated). And since there's no real evidence that it's a better fit to the R/G distribution, I don't see any reason to believe that it would produce a better estimator of W%. Which is certainly not to say that one shouldn't try and see if it in fact does.

7. I can agree with everything in that statement.

Something I overlooked in Miller's article. He's using the three parameter Weibull. I just assumed he was using the regular Weibull. When people use the term Weibull, they usually mean the two parameter Weibull from survival analysis. Fitting the 3 par to data is problematic as sometimes max like doesn't exist in closed form. I'm not even sure how to fit the 3 par W to data.

Comments are moderated, so there will be a lag between your post and it actually appearing. I reserve the right to reject any comment for any reason.