Comments on Walk Like a Sabermetrician: On Run Distributions, pt. 2: Negative Binomial

Eli, I don't have the requisite knowledge to ...

2015-11-04T19:31:49.597-05:00

Eli,

I don't have the requisite knowledge to answer the question with statistical rigor, but what I can tell you (some of which I posted in parts 3-7 of this series and some of which I need to write up still) is:

1. When a correction for P(0 runs) is applied, the model appears to be reasonably accurate at estimating the distributions for major league teams with normal levels of run scoring

2. The results are consistent with those you would get from using Tango's runs per inning distribution (Tango Distribution) to estimate a runs per game distribution.

3. The estimates of W% that one can calculate by estimating a teams distribution of runs scored and runs allowed (i.e. then P(win) = P(score >0)*P(allow 0) + P(score >1)*P(allow <=1) + ...) are very similar to those found using other run/win converters (such as Pythagorean).

The classic definition of the negative binomial se...

2015-11-04T19:10:17.226-05:00

The classic definition of the negative binomial seems like it corresponds to the question "how many times does the batting team get on base before accumulating three outs?" Which has a notable difference from runs scored: essentially only the 4th+ of these successes are scoring.

Now I'm surprised the fit is as good as it is. This might explain why the model doesn't have enough shutouts, but it seems like it would overdo it.

Does any of this work out under more rigor?

The negative binomial distribution gets in statist...

2012-06-17T21:56:31.947-04:00

The negative binomial distribution gets in statistical in negative binomial regression. It's a type of generalized linear model for count data where you have y=exp(mx + b). In NB regression the variance is assumed to NB distributed. Originally this kind of model would have been estimated with a normal distribution. The problem with the normal distribution is that the variance is assumed constant. In count data, large predicted values are expected to have larger variances. Poisson regression was a big advance for count data because it allowed larger predicted values to have larger variances. With a Poisson distribution, the variance is equal to the mean. In many data sets the variance is greater than the mean (very rarely it's smaller). Variance that is much larger than the mean is called overdispersion and this is where NB regression comes in. NB regression allows the variance to grow quadratically. There are other variance models, including quasi-Poisson which let's the variance increase linearly with the predicted value.
The reason the variance is so important, is that if you get it wrong, then the standard errors of the coefficients, f values, t values, and chi-squares can all be wrong.

Alan