## Monday, June 23, 2008

### Run Estimation Stuff, pt. 2

In this installment I will derive Base Run equations based on the 1960-2004 linear weights (based on Tom Ruane’s work) introduced in the last segment. I will present four different BsR versions here, which is admittedly a bit of overkill. They will each follow a slightly different approach to the BsR model. First, we need to recall that BsR is A*B/(B + C) + D. The A factor is baserunners, the B factor is advancement, the C factor is outs, and the D factor is guaranteed runs.

There are a number of different ways to define each factor. Starting with A, we could look at A as “initial baserunners” or “final baserunners”. Initial baserunners would include all runners who reached base, or, since we are only dealing with the official categories here, H + W - HR + HB. This counts all of the runners that we know have reached base, and ignores any knowledge that we have about what happened once they were on base.

The final baserunner approach starts from the same count, but then removes the baserunners we know were put out on base. There are two categories in the official stats that tell us a runner was retired on base: CS and DP (grounded into double play, but I have referred to them simply as DP throughout the series). So final baserunners would be figured as H + W - HR + HB - CS - DP.

Now we can move on to C, a factor that is usually defined as batting outs, which in this case would be AB - H + SH + SF. It could also be set up as all outs, which would be AB - H + SH + SF + CS + DP. Then you could mix and match the two versions of C with the two versions of A. So there will be one version that is initial baserunners with batting outs, one that is initial baserunners with all outs, etc. Again, this most definitely is overkill, but I think it will be interesting to take a look at the results either way, and you can choose whichever one you want (most people who have gone before have used the initial baserunner, batting out approach, and if I had to recommend just one, I would agree with that choice).

D as guaranteed runs is usually glossed over as simply being equal to home runs, but that is not inevitable. It is true, for instance, that SF are a case in the official stats for which we know for a fact that a run was scored, and SF could be included in guaranteed runs. This approach could eventually lead to some absurdities; if the record-keeping was as detailed for all events as it is for SF, we could eventually have categories like “1-RBI singles”, “2-RBI doubles”, “1-RBI triples”, and we would end up with an equation that said Runs = Runs.

Taking the opposition to SF in the D factor a step farther, the general argument can also be used against the inclusion of a situational event like a SF or DP at all. After all, sacrifice flies are simply a subcategory of flyouts, and double plays of groundouts. The record keepers have not deemed to give us such minute breakdowns with other categories--as Tango Tiger has pointed out, events like caught stealing include cases in which there is actually no out recorded, and batting outs include reached on error, etc.

However, given that the data does exist, there are some people who want to utilize it. Also, the technical versions of Runs Created, which ideally would be supplanted by Base Runs, use this data, and so those users would presumably want a replacement equation based on the same inputs. For those who prefer the more granular approach, Tango Tiger’s full BsR version has already done the heavy lifting for you.

In addition to the potential for using SF as a D input, one could also go through and add fractional values for guaranteed runs on other events. For example, there will be some proportion of triples that result in runs due to an error that allows the batter to score. One could have these fractional categories in A, C, or D factors. For example, Tango Tiger’s full BsR version counts 8% of SH towards A, as around 8% of batters credited with a sacrifice reach base safely. Another example is that one could put a fractional weight on CS in C, because not all CS result in outs. Leaving aside the question of how an estimate of “X% of triples result in runs” fits into a factor I’ve billed as “guaranteed runs” (obviously the words used to define the category can be finessed), I will just sidestep the whole issue by saying I have not dealt with fractional weights anywhere (except of course in B). That doesn’t mean that it would be illegitimate to due so.

Finally, we come back around to B, which I glossed over the first time. B is usually considered to be the nebulous “advancement”, but it can also be looked at as the balancing part of the formula, where the values of events are forced into line with what we know them to be for an average team. Since we have already established the formulas and values for A, C, and D, we can calculate the B factor necessary to force the linear weights to the long-term averages derived in the last post, and repeated here:

LW = .460S + .756D + 1.037T + 1.405HR + .304(W - IW) + .174IW + .329HB + .192SB - .260CS - .068(AB - H - DP - K) - .107K - .459DP + .071SH + .154SF

I have described the process for doing this several times, and will not further clutter this page by doing it again, but here is a link to the BsR page in Tango Tiger’s wiki, where it is covered. Now, let me define four different BsR formulas (with their naming style in homage to Bill James) that we are going to look at, and the component they use (A, B, C, D):

Full-1: iA, B, bC, HR

Full-2: iA, B, aC, HR

Full-3: fA, B, bC, HR

Full-4: fA, B, aC, HR

Where:

iA = H + W - HR + HB (initial baserunners)

fA = H + W - HR + HB - CS - DP (final baserunners)

bC = AB - H + SH + SF (batting outs)

aC = AB - H + SH + SF + CS + DP (all outs)

Writing out the formulas for each of the resulting B factors would be very cumbersome, so I have put them in chart form:

You may notice some oddities in the B weights as you peruse the table. Most problematic is that the walk coefficient is negative. This obviously will cause a whole bunch of problems for theoretical situations with extreme walk rates. Thus, I have also presented a “corrected” version (in the style of Full-1, with initial baserunners and batting outs as the definitions for the A and C factors respectively; it is “F-1W” in the table) in which the walk is given a coefficient of .025--I chose this number haphazardly, and you could very well improve it. However, my primary objective here is just to clean up the obvious problems caused by assigning a negative advancement value to the walk. This requires some juggling of the other B coefficients, and the linear weights that it generates when applied to our long-term stats will no longer be the same as the target linear weights (the modified Ruane weights). This is unfortunate, but it also is necessary in this case to avoid having a negative weight for the walk:

A = H + W - HR + HB

B = .719S + 2.098D + 3.408T + 1.887HR + .025(W - IW) - .613IW + .109HB + .895SB - 1.211CS + .121(AB - H - K - DP) - .061K - 1.701DP + .769SH + 1.155SF

C = AB - H + SH + SF

Here is a comparison of the target weights (“Ruane”), the weights generated by this equation (“Result”), and the difference (Result-Ruane):

Thus, we still have a pretty decent match for the target weights, with the walk not surprisingly as the biggest source of error.

Let me close this out by also presenting a formula that only looks at the basic events, and matches these weights that we derived last time:

LW = .473S + .769D + 1.050T + 1.418HR + .304W + .192SB - .265CS - .088(AB - H)

I have two versions here; one is with initial baserunners (which I will call B1, where iA is H + W - HR) and one with final baserunners (which I will call B2, where fA is H + W - HR - CS):

B1 = .764S + 2.169D + 3.503T + 1.985HR - .039W + .912SB - 1.258CS + .036(AB - H)

B2 = .762S + 2.247D + 3.658T + 2.098HR - .087W + .964SB + .283CS + .032(AB - H)

Finally, I apologize for any formatting errors in this post. Blogger seems to have changed its text editor, and when I copy and past from Word, it leaves just one space after a period. This annoys me as a reader, but it is a real pain in the burro to go back and add all of the necessary spaces.

EDIT: Well isn't this lovely? Now it messed up the formatting for the entire front page. If Blogger can't make their editor workable for people who aren't HTML experts, then I'm going to have to go elsewhere. I'm a poor enough writer as it is; I don't need to have a blogging platform make my posts look like they were formatted by a future Michigan alum trying to pass kindergarten on his third attempt.

1. Great stuff. Any ideas as to why the walk coefficient flipped into the negative? I agree that it's probably correct to force it to be a positive number. Just seems surprising to me that this would happen.

Also, FWIW, the single space thing is default html convention (and european style, of course) and it takes a special html code to get a second space after a period. No idea why Blogger isn't doing that for you, though. -j

2. Thanks, I did not know that was a default conversion.

Re: the walk coefficient, the first potential explanation as always is random variation with this particular sample (although 1960-2004 is pretty large, so it shouldn't be an issue). But the walk coefficient is always low to begin with, so it moving by .05 can take it into negative territory, whereas if I had gotten a coefficient of 1.9 for a double and some previous formula had 1.85 I probably wouldn't have even noticed.

The missing data probably doesn't help either. Tango's full version accounts for just about everything that could possibly happen...balks, wild pitches, runners left on base in innings that aren't played to completion...and he gets .05 or so for the walk.

Then there is Tango's explanation, which is true and could act in conjunction with 1 or 2 as well. (It is that BsR assumes that all baserunners will score with the same frequency, but of course this is not true...and you can go there to read the rest).

The good news is that the .025 walk rate version only overrates the walk by .007 runs, and is only a couple hundredths of a run worse in RMSE.

3. I took a look at the F1-W version of the formula, and to get everything to balance, you have to use a C factor of AB-H-DP+SH+SF. For the F1-W version, you stated you were using AB-H+SH+SF for the C factor. Am I missing something here? Since you used AB-H-K-DP in B, you would have to use DP in C, or the linear weight value of a DP would be understated.

4. The F1-W formula and the F1 formula are both based on:

A = initial baserunners = H + W - HR + HB

C = batting outs = AB - H + SH + SF

Double plays are accounted for in the B factor of F1-W; they have their own weight of -1.7.

In a version like F2, which is based on initial baserunners and all outs in C (AB - H + DP + SH + SF + CS), the DP has a lower (absolute) B weight (-1.36). The balancing occurs for all of the events in the B factor.

If you put in the actual event frequencies for the 1960-2004 period, the F1-W formula predicts 787,363 runs--the exact number that was scored (that is not to illustrate any accuracy claim, as it is designed to do just that. But it does balance).

Perhaps the confusion is from the fact that GIDPs are already counted in AB-H. Adding them in again counts both outs--the out recorded on the baserunner and the batter being retired.

I also should probably use "GDP" instead of "DP", since DP implies all double plays, but since I said upfront I am only dealing with the official offensive categories and the only such category is GIDP, I find the extra letters unnecessary, just like the P in "HBP".

Hopefully that clears this up despite its rambling nature :-)

5. Is it OK to adjust the out values in the B factor to force your equation to fit a specific dataset? For instance, if I calculate a multiplier of 1.03, instead of multiplying everything in B by 1.03, could I just adjust the out value to bring it closer to 1?

6. That is an excellent question. Allow me the indulgence of answering by way of a digression.

One of the problems with just multiplying the B factor by a scalar (like 1.03) is the effect that has on the negative events. The 1.03 multiplier means that all of the weights will be increased, which will make the outs more costly, even though we are already UNDERestimating runs scored.

So an alternative you could try is this: find the B shortfall for your input data (in other words, B'-B instead of B'/B). Then, distribute this evenly over all events. You could do (B'-B)/PA, and if that was .004 then you would add .004*PA to the B factor. If you want to get SB and CS in there, you could use (PA + SB + CS); hopefully the error won't be large enough that it will make much of a difference either way).

I would not recommend applying the change only to the out coefficients. But you could try the addition method rather than the multiplication method, and for all I know it might work better (I have not tested it).

7. Would you use this (B'-B)*PA method for batting outs and strikeouts?

8. You would add it to each event that included a PA (which would include batting outs and Ks). Or you could leave it in the B factor as x*PA, which would be mathematically equivalent.

B = .7S + 2D + 3T + 2HR + .01W + .1(AB-H-K) + .02K

Suppose that the needed B was 1000, the actual B was 970, and there were 3000 PA. (1000-970)/3000 = .01. So we would have the new B factor:

.71S + 2.01D + 3.01T + 2.01HR + .02W + .11(AB-H-K) + .03K

9. This comment is from terpsfan101...don't worry, I'm not writing questions to myself :)

Patriot,

I was bored so I fooled around with your Baseruns equations. For some reason I couldn't get Ruane's LW to reconcile to your values. I added the shortfall per/PA like you did.

I'm pretty sure Ruane included ROE's under AB-H-K-GIDP, because I get approximately a .04 RC run differential between AB-H-K-GIDP and SO when I include ROE under AB-H-K-GIDP.

If you include partial baserunners for SH and AB-H-K-GIDP to account for ROE's, then you don't get negative B coefficients for the walk when using initial baserunners. You still get negative B coefficients using final baserunnners.

10. Did you use the totals that I included in the first part of the series, or did you use the 1954-2007 totals (those are the only ones I see in your spreadsheet)?

It seems like your values are pretty close to mine, so the potential difference in input would be sufficient to explain the differences.

You have a good point about attempting to estimate errors and including a fractional SH. My guiding principle behind these formulas was not to estimate any missing data. That choice makes the formula look nicer, but it admittedly does make it less than optimal from an accuracy standpoint.

11. I used the 1960-2004 data for Ruane's LW. I'll go back and enter a column for the totals. We could of used different source for the totals. I used the BDB database.

Comments are moderated, so there will be a lag between your post and it actually appearing. I reserve the right to reject any comment for any reason.