Walk Like a Sabermetrician: March 2020

Monday, March 23, 2020

Tripod: Theoretical Team Base Runs

See the first paragraph of this post for an explanation of this series.

While Base Runs is an incredibly flexible run estimator when it comes to working across a wide range of contexts, as a multiplicative formula it is not directly applicable to individual batters. However, there are a number of ways that you can use Base Runs to assist in your evaluation of batters. One way is to use Base Runs to calculate Linear Weights for your entity, and then apply these weights to the individual batters in the entity. You can find the weights for the 1978 AL and calculate Reggie Jackson's linear weights from this. Or you could find the weights for the Yankees and get a measure of Jackson's run creation in his own team context. Or you could find the weights for the Red Sox and see how many runs Jackson would have created in that context. The possibilities are close to limitless.

However, when you calculate Jackson's value in the Red Sox's context, you have not accounted for the fact that if Jackson played for the Red Sox, he would change that context. If you want to include this effect, things get a bit more complicated.

The basic ideas in this area were pioneered by David Tate, who published a method called Marginal Lineup Value which used Runs Created in a similar way. Keith Woolner also played an important role in the development of MLV. While the options I am detailing here are not directly adapted from Marginal Lineup Value, many of the ideas are and their work has set of the light in my head and those of others who have laid out similar techniques, so their contributions must be recognized.

Bill James "new" Runs Created introduced in the STATS All-Time Major League Handbook and used in their other publications since then(as well as the Bill James Handbook from Baseball Info Solutions) also incorporates many of these ideas and introduced an ingenious way to state absolute results--that is, the number of total runs created rather then runs above some baseline as the Tate/Woolner method did.

The first step in applying this method is to assume that we have a team of 8 average players each getting an equal number of Plate Appearances. Then we add the player in question to this team, with the same number of PAs as the other eight players, which we will make equal to the player in question's actual PA. Then we calculate the new A, B, C, and D factors for this team.

Let's use Mark McGwire's 1998 season as an example of how this works. We will put him on a team that performs at the 1961-2002 composite data discussed in the BsR article. This league has a ROBA of .3007, AF of .3047, OA of .6763, and HRPA of .0230. McGwire personally compiled an A factor of 244, a B factor of 267.69, a C factor of 357, and a D factor of 70(there will be rounding differences with the spreadsheet throughout this essay).

The non-McGwire portion of the team will have an A factor of 8*PA*LgROBA, where PA is McGwire's PA and LgROBA is the ROBA for the entity in question. We will call the 8*LgROBA portion as E. From there:

E = 8*LgROBA
F = 8*LgAF
G = 8*LgOA
H = 8*LgHRPA

For the 1961-2002 data(which I will call from here on out the "standard" or "reference" league) these values are E = 2.41, F = 2.45, G = 5.41, and H = .184.

Then, the new A factor for the team with McGwire will be A + E*PA, where A is McGwire's personal A and PA is, again for the last time, his personal PA. Then:

TmA = A + E*PA
TmB = B + F*PA
TmC = C + G*PA
TmD = D + H*PA

We then put these together to estimate the number of runs this team will score with McGwire as TmA*TmB/(TmB + TmC) + TmD, and subtract from this the number of runs the eight players would score without McGwire. Without McGwire, the team will score LgROBA*LgAF/(LgAF + LgOA) + LgHRPA times eight times PA. We can make a formula for I:

I = 8*(LgROBA*LgAF/(LgAF + LgOA) + LgHRPA)

For the standard league, I = .93. Then we can make a big equation for the difference between the team with McGwire and without McGwire:

TT BsR = (A + E*PA)*(B + F*PA)/((B + F*PA) + (C + G*PA)) + (D + H*PA) - I*PA

Which algebraically simplifies to:

TT BsR = (A + E*PA)*(B + F*PA)/(B + C + (F+G) * PA) + D - (I - H)*PA

Which, for the standard league is:

TT BsR = (A+ 2.41PA)*(B + 2.44PA)/(B + C + 7.86PA) + D - .75PA

For McGwire, we get a value of 169.03. This can be compared to his personal BsR, calculated through the team formula, of 174.86, or the LBsR for McGwire when you use the linear weights derived by BsR for the standard league of 168.38. So you can see that since McGwire was a high-production player, his personal BsR is higher then what you get if you put him on a standard team. But since McGwire personally alters the run environment of the team he is added to, his TT BsR is higher, although only slightly, then his LBsR.

We can also find McGwire's TT BsR above other baselines then absolute. I will use average here and below will (tentatively) sketch out a procedure to use replacement level(or any other baseline for that matter). To apply an average baseline, all we have to do is compare McGwire to a team of 9 average players rather then 8 average players. We can use the same formulas as above, except that for this team I will be figured as:

I = 9*(LgROBA*LgAF/(LgAF + LgOA) + LgHRPA)

I = 1.05 for the standard league, which gives this equation for TT BsR Above Average:

TT BsRAbvAvg = (A+ 2.41PA)*(B + 2.44PA)/(B + C + 7.86PA) + D - .87PA

For McGwire, this gives a value of +90.76 runs above average.

These formulas are very long and confusing. One thing we can do is differentiate them and state them as a new set of custom LW for the team now that we have added the player. The formula for this is:

LW = ((B + C + (F + G)*PA)*((A + E*PA)*(b + F*p) + (B + F*PA)*(a + E*p)) - (A + E*PA)*(B + F*PA)*(b + c + F*p + G*p))/((B + C + (F + G)*PA)^2) + d - I*p + H*p

In this formula, p is the derivative of the plate appearance function for each event, where PA = AB + W + HB + SH + SF. In the case of McGwire, we know that the LBsR weights for the standard league(displayed as S, D, T, HR, W, O) are: .476,.806,1.136,1.495,.320,-.095 which gives him 168.38 runs. Using the formula above, we get .490,.823,1.157,1.499,.331,-.103 which produces 169.03. These results are similar to calculating the new rate stats for the team with our player added. For example, TmROBA = 1/9*ROBA + 8/9*LgROBA, and on in this fashion, and then use the classic LW from BsR formula to find the LW(TmROBA is A, TmAF is B, etc.)

You can also use the above formula with the Above Average TT formula--the only difference is that you have to use the different I for the nine-man lineup. For McGwire, this gives these LW: .373,.707,1.040,1.383,.215,-.220. The effect of this technique is to subtract the League R/PA(as figured by BsR) from each event that accounts for a PA(or in the lingo of the method above, has p = 1), and makes no change to any event that does not accounts for a PA(p = 0). This happens because the only difference in the two formulas is the difference in the I values. The I value above average is 9*LgR/PA, and the I absolute I value is 8*LgR/PA. So the difference is LgR/PA, but this is only multiplied by PA. So an event like a steal that does not account for a PA does not lose any value at all between the two formulas. This probably illustrates that the TT Average technique is a shortcut but not a solution, because the difference is based on subtracting PA rather then comparing to outs or team outs, etc. The best way to find the TT BsR above some baseline would probably be to first find the Absolute TT BsR and then apply some baseline comparison as you would with any other runs created estimate.

When the Theoretical Team procedure is applied to Runs Created, it just so happens that TT RC = 1/9*Traditional RC + 8/9*Linear RC. In the past, I have incorrectly used this fact as the proof in my mind and said that the same was true for Base Runs. It is not true. I am not quite sure the technical reasons why this is, but I believe it is because the RC formula is pure multiplication. A*B*(1/C) if you will. But BsR involves two additions(B+C and adding D to the whole thing), and I think this eliminates the property. Anyway, it still comes pretty close to this. You can set up this equation:

TT BsR = x(BsR) + (1 - x)(LBsR)

If you solve for x:

x = (TT BsR - LBsR)/(BsR - LBsR)

If you do this for McGwire, you find that his TT BsR is made up 10.6% of his Straight BsR and 89.4% of his Linear BsR.

So far we have assumed that the player keeps the same number of PAs he had in actuality when we move him onto a new team. But we know that this, too, is a simplification. Just as the batter changes the run values of the team he is on by changing the context, his ability to avoid outs(or, equivalently ignoring outs made on the basepaths, get on base) will directly impact the number of Plate Appearances his teams will have in which to score runs. To account for this, we will add a new factor called PAR to the Theoretical Team BsR formulas.

Before we do this, though, it should be pointed out that when we do this we are leaving the realm of attempting to estimate the number of runs the player has actually created and are trying to estimate the number of runs the player would theoretically create if added to an otherwise average team. For one thing, the player's actual PA already incorporate the effect of the extra PAs he adds by getting on base. So we can easily overstate his impact by allowing him to further inflate his PA on an average team after inflating his own PA on his own team. If the team he actually plays for has an above average rate of getting on base with him included, we will overstate the PA he will wind up with on his theoretical team. What we could do is find the actual percentage of his actual team's PAs that he used, convert this to an equivalent percentage on an average team, and plug that into the formula.

However we choose to do this, we will have some number for PA and go from there. The first step will be to calculate what I will call Not Out Average(NOA). NOA is simply the percentage of Plate Appearances that do not result in outs as recorded in the official statistics. NOA = (H + W + HB - CS - DP)/(AB + W + HB + SH + SF). We will further say that the denominator AB + W + HB + SH + SF = P(replacing PA in the formulas to come), and that the numerator H + W + HB - CS - DP = N. The derivatives of these(with each event that is counted in P or N has a p or n of 1 respectively) will be called p and n.

We will first calculate the NOA for the team with our player added as TmNOA = NOA*(1/9) + LgNOA*(8/9). We know that PA/G can be estimated as X/(1 - NOA), where X is the number of outs/game in the league that are accounted for in the official statistics. So we want the ratio between the PA/G for the team with our player and PA/G without our player, which we will call PAR for PA Ratio(this is a term I have borrowed from David Smyth). PAR = (X/(1 - TmNOA))/(X/(1 - LgNOA)). Simplifying this results in PAR = (1 - LgNOA)/(1 - Tm NOA). Running through this with McGwire, the LgNOA = .3150, NOA = .4680, TmNOA = .3320, and PAR = 1.0254. So an average team with McGwire getting 1/9 of their PA will wind up with 2.54% more PA then a totally average team.

We then need to change each factor of that we put in the BsR equation to account for PAR. For example, we started with TmA = A + E*PA. When PAR is incorporated, this is now TmA = A*PAR + E*PA*PAR, which can be rewritten as TmA = (A + E*PA)*PAR. The TmB, TmC, and TmD calculations are analogous. We then simply substitute these formulas into the original TT BsR formulas to get:

TT BsR w/ PAR = PAR*((A + E*P)*(B + F*P)/(B + C + (F + G)*P) + (D + H*P)) - I*P

Remember, we are now using P as the abbreviation for our player's Plate Appearances. As you can see, the I*P portion is not multiplied by PAR. This is because this part represents the number of runs the team would score without our player. PAR measures the effect of our player on the team PA/G, so it is irrelevant to how many runs the team would score if he did not play for them.

Just as with the original formula, we can easily compare to average by changing the I value as done previously. With PAR, we find McGwire's absolute TT BsR as 189.26 and +110.99 above average.

Just as we have done previously, we can differentiate this equation to see the intrinsic linear weights that it uses. It is a long formula with an even longer derivative, so I will break the derivative up into two pieces.

The first step is to find the derivative of PAR with respect to each event. This is done by first differentiating NOA with respect to each event to get dNOA/dX, where X is S, D, T, HR, etc. Then we differentiate PAR with respect to NOA to get dPAR/dNOA. From here, (dPAR/dNOA)*(dNOA/dX) = dPAR/dX. This results in this formula:

dPAR/dX = (1/9)*(1 - LgNOA)/((1 - TmNOA)^2)*(P*n - N*p) /(P^2)

We can then differentiate the entire PAR TT BsR equation to get the formula for the linear weights there. In the equation below, dPAR/dX represents the derivative of PAR, figured by the above formula, with respect to whatever event we are differentiating the PAR TT BsR formula for:

LW = PAR*((B + C + (F + G)*P)*((A + E*P)*(b + F*p) + (B + F*P)*(a + E*p)) - (A + E*P)*(B + F*P)*(b + c + F*p + G*p))/((B + C + (F + G)*P)^2) + ((A + E*P)*(B + F*P)/(B + C + (F + G)*P) + D + H*P)*(dPAR/dX) - I*p

Yes, that is the longest sabermetric equation I have ever published on this website, or anywhere else for that matter. When we do this for Big Mac, we find .633,.975,1.317,1.669,.471,-.176. Again, by changing I to the average value we can get the LW for TT BsR Above Average W/ PAR, and again the difference is to subtract LgR/PA from each event where p = 1.

Applying Replacement Level

This is a real pain to calculate, and I don't use it, but I think it is a useful discussion to have for a number of reasons. If I wanted to apply a replacement level to TT BsR, I would calculate Absolute TT BsR and then apply the baseline from there. But we will look at the alternative.

To calculate Absolute TT BsR Above Replacement, all we would have to do is find an I value that would represent runs/PA for a team with 8 average players and 1 replacement player. The 8 average players part is easy, but in order to figure the replacement player in, we need to know how he will hit in terms of ROBA, AF, OA, and HRPA. Usually, though, we set replacement level as some percentage or linear difference of run production(be it in terms of per out or per PA, or Wins Above Average per PA, or R+/O+, or R+PA, etc.). But those assumptions don't tell us how the player will hit in terms of basic offensive events, just total production.

I will use 73% of the league runs/out as the baseline in this article(see the "Baselines" article for discussion of this), although you can apply a different baseline and still use the outlines of my procedure to do it. The first step will be to understand the Linear Weights Ratio(LWR). There are probably alternative ways to do this, but I have done it this way and it suits my purposes.

LWR is a great tool invented by Tango Tiger that uses the LW coefficients and converts it into a ratio of positive run production to outs. I have linked his little article on it at the bottom of the page, but will cover the basics again here. Before I start, I should discuss the treatment of various events in my concept of replacement level here. I am assuming that a replacement level player is a replacement level player because of his hitting performance(S, D, T, HR, W, outs). He will steal bases, bunt, hit sac flys, hit into DPs, etc., at a league average rate. There are certainly debatable assumptions in there, but you have to keep things reasonably simple.

To establish LWR, we put the positive value of S, D, T, HR, and W in the numerator. We then set the single weight to one and rescale all of the other coefficients based on their ratio to singles. So let d = LW(double)/LW(single), and t = LW(triple)/LW(single), etc. Then we have this formula(all of the terms in the formulas that follow unless otherwise marked apply to league statistics):

LWR = (S + d*D + t*T + hr*HR + w*W)/(AB-H)

For the standard league:

LWR = (S + 1.693*D + 2.386*T + 3.139*HR + .671*W)

Once we have this, we can this fact about LWR:

Runs/Out = LW(single)*LWR + LW(out)

For our league, the Runs/Out from LWR is .172(the LWR itself is .562), and the LW out value is -.095. 73% of .172 this is .126. What LWR will produce a R/O of .126? First, let x be the replacement rate(73%). Then RepLWR is given by the equation:

RepLWR = (x*(LgR/O) - out value)/LW(single)

This results in .464, which converts back to .126 runs/out.

So we know that a replacement player will put up a LWR of .464. Now we need to convert this relationship back into the effect on his component stats. What we do first is find a value that I will call Y. Y is the ratio of the quantity of "positive" in the LWR that the league has generated from a given event divided by the quantity of "positive" it has generated from singles. To illustrate, the standard league has a single per PA of .166. On a per PA basis, the positive LWR contribution of singles is 1*.166 = .166. The league has double/PA of .041. The positive LWR contribution of doubles per PA is 1.693*.041 = .069. .069/.166 = .418 is the Y value for doubles. Sum up the Y values for all events(including singles). Or if you prefer a formula:

Y = 1 + (d*D/P)/(S/P) + (t*T/P)/(S/P) + (hr*HR/P)/(S/P) + (w*W/P)/(S/P)

Y is 2.290 for the standard league.

We also need another quantity, Z. Z is simply the ratio of the rate of a given event divided by the rate of singles. So Z for doubles is .041/.166 = .244, and the formula for the summed Z values is:

Z = 1 + (D/P)/(S/P) + (T/P)/(S/P) + (HR/P)/(S/P) + (W/P)/(S/P)

Z is 1.950 for the standard league.

What exactly have these Y and Z steps done? They have converted all of the contribution of doubles, triples, home runs, and walks into an equivalent number of singles. What we are saying is that for the standard league, the quantity of positive LWR is equivalent to 2.290 times the number of singles(this is Y), and the number of runners on base is equivalent to 1.950 times the number of singles(this is Z). This procedure is in a similar spirit to the "Willie Davis method" introduced by Bill James in the New Historical Baseball Abstract, in which he expresses everything in terms of an equivalent number of hits. Why does he do this? Because it allows you to have one variable to solve for in an equation instead of five. Once we find the value of S that we are looking for, we can convert this back into D, T, HR, and W values.

What we are after is the rate at which a replacement player would hit singles to produce a .464 LWR. We have this equation:

RepLWR = Y*X/(1 - Z*X)

Where X is the S/PA for the replacement player. The equation to solve for X is:

X = RepLWR/(Y + RepLWR*Z)

So for the standard league, X = .145. The replacement player will get a single in 14.5% of his PAs compared to 16.6% for an average player. Since we have assumed that S, D, T, HR, and W will all be reduced by the same percentage, we divide .145 by .166 to get the "Multiplier". So Multiplier = X/(S/P) and is .875 for the standard league. So the replacement level player in the standard league will hit singles, doubles, triples, homers, and draw walks, at 87.5% of the rate that an average player would. Just to be absolutely clear, Rep(D/P) = (D/P)*Multiplier, and so on.

For the out value, there are two mathematically equivalent techniques. One is to find Rep(O/P) as 1 - Rep(S/P) - Rep(D/P) - Rep(T/P) - Rep(HR/P) - Rep(W/P). The second is to figure Rep(O/P) as 1 - (1 - O/P)*Multiplier. The second equation is essentially equivalent to saying that the OBA for the replacement player will be 87.5% of the OBA for the average player as well.

Once we have calculate the S, D, T, HR, W, and O per PA for a replacement player, we can calculate the ROBA, AF, OA, and HRPA for him(ignoring all terms other then those we have for the replacement player*S, D, T, HR, W, and O). We then calculate rtROBA as (1/9)*RepROBA + (8/9)*LgROBA(rtROBA is "replacement team" ROBA; that is, a team that is 8/9 average and 1/9 replacement). We calculate the other terms similarly and then figure the I value for the replacement comparison as:

I = (rtROBA*rtAF/(rtAF + rtOA) + rtHRPA)*9

For the standard league, I = 1.02 and McGwire is +108.38 runs above replacement. We can also apply PAR using the same formulas as above.

Let me now just briefly discuss the method I used to find stats for a replacement player. One major weakness that I already mentioned was limiting the difference between the replacement player and an average player to only the basic hitting events. Another is that I assume that among the basic hitting events, all deflate equally. The replacement player in the standard league has a rate of 12.5% less singles, 12.5% less doubles, etc. I have not studied the issue, but I would assume that replacement type players lose more in secondary offensive skills(power and walks) then they do in singles. Of course, you also get into an issue of whether the replacement player should be based on the various definitions of replacement level that have been offered, or whether it should be theoretical. If you are looking for a theoretical approach, assuming equal deflation of all basic offensive events can be justified.

Another concern is how to define replacement level, or baseline to be more general. I have used a default of 73% of league runs/out which corresponds to a .350 Offensive Winning Percentage which was used by Bill James and continues to be used by many analysts. Then I have used Linear Weight Ratio to estimate how their component stats would turn out. However, it might actually be more appropriate to set replacement level as a percentage of league LWR or some other approach. The method I have laid out here could be modified for other choices of definition, but it is not ready to handle another definition as is.

I will also point out that the replacement definition method has some broader applications than just replacement level. Suppose you have positional adjustments defined as a percentage of league R/O as I do elsewhere on this site. If first baseman perform at 115% of the league average R/O, what should their BA/OBA/SLG be? You can use the replacement level method here to get an estimate for that. How about you know that a park inflates runs by 10%. How much should it inflate OBA by? (If it affects all events equally, which it probably doesn't. But it could tell you what a theoretical park would do. Or maybe you know it won't affect walks, so you could hold those constant. You get the idea). I'm sure you could think up other uses as well. But that's another article.

Tango Tiger's LWR Page
Base Runs Spreadsheet

Tuesday, March 10, 2020

Tripod: Base Runs II

See the first paragraph of this post for an explanation of this series.

This is the second page on Base Runs that I have written up for this site. It makes no attempt to cover new concepts that weren't addressed in the original page. What it does try to do is write-up the information from the first page in a more accessible way. I have seen comments that the Base Runs page on this site is hard to understand. Unfortunately, the main cause of this problem is probably my writing style and skill (or more appropriately lack thereof). However, it is true that the original page was created by adding on new concepts as time passed, and therefore is somewhat of a hodge-podge of different ideas, written at different times, without a comprehensive master plan in mind. This page will attempt to address this.

Philosophy and Origins of Base Runs

Base Runs is a run estimator developed in the early 1990s by David Smyth. Like Runs Created, BsR is designed to estimate the number of runs that a team would score. Methods of this type attempt to incorporate the interactive effect of offensive events. A linear weights formula like Extrapolated Runs or Estimated Runs Produced (or even, essentially, Clay Davenport's Equivalent Runs) applies a static run value to each event. Usually these formulas weight walk at around 1/3 of a run. And in most circumstances, this is a good estimate of the number of runs that will result from a walk. But in a game in which a team draws a walk and makes 27 outs, the walk will not have the same value. In fact, since estimators like ERP apply a value of about -1/10 of a run for every out, it will predict somewhere in the neighborhood of -2.4 runs for that game. This answer is obviously wrong.

The reason why this is that linear formulas are designed to work with a certain range of data that corresponds to the range in which normal major league teams perform. When you apply the method in these contexts, it will give very accurate estimates. But when you attempt to take the method outside of the context in which it was developed for, problems will result. None of this is meant to put down linear formulas which are very useful in sabermetrics. It only stands to illustrate the much more difficult task that BsR or RC attempt to perform. Ideally, they should be models of run scoring that work over a wide range of contexts and can give an accurate estimate for unusual or extreme situations. Another way to look at this is that Base Runs generates custom linear weights that are intrinsically generated and then applied in all situations.

Unfortunately, Runs Created does a very poor job of estimating in extreme contexts--in fact, in many cases poorer then linear methods! The reason for this is that while RC is constructed based on reasonable principles of how an offense it works, it does not recognize certain constraints on the number of runs that will be scored.

For example, a home run will always produce at least one run. It does not matter if every other batter has made an out, the team will get a run if they go deep. The Basic RC formula of (H+W)*TB/(AB+W) would predict (1+0)*4/(28) = .14 runs for a team that hit a homer and made 27 outs. But we know that they must score at least one run.

Furthermore, if all you do is hit home runs, each home run will produce just one run. Suppose a team entered the bottom of the ninth trailing by two runs, and the first two batters hit home runs. RC would predict (2+0)*8/2 = 8 runs, or 4 from each home run. This is another impossibility.

Another "known point" is the case of all outs. You will score zero runs if all you do is make outs, but you cannot wind up with negative runs. RC correctly predicts zero runs, but all linear methods must predict negative runs below a certain level of production in order to have any accuracy in normal levels.

Base Runs gives much more reasonable estimates in these extreme circumstances. This is because it starts with a true model of how runs are scored. Each batter that comes to the plate will eventually do one of three things: make a batting out, hit a home run, or reach base. Once he has reached base, there are three more potential outcomes: he will score, make an out on the bases, or be left on base at the end of the innings. Simplifying further, an identity for the number of runs scored can be written as Baserunners * % of base runners who score + Home Runs. This is an undeniably true statement. BsR uses this model to derive an estimate of runs scored.

Although the identity is undeniably true, the estimates that the formula uses are not. If we are given a team's offensive statistics but not their runs scored, we can never know for sure what percentage of baserunners will score--if we knew this, we would have a method with 100% accuracy. We do know for sure the number of home runs, and we do have a very good estimate of the number of baserunners (but we don't know, for instance, how many runners will be retired stretching doubles into triples). It is the percentage of baserunners who score that involves an estimate that is not assured of being almost 100% correct, and therefore this component is a crucial determinant of the accuracy of the estimate.

Smyth broke his formula into four factors denoted as A, B, C, and D. A is simply the number of baserunners. D is simply the number of home runs. B is the "advancement factor", representing the advance of baserunners towards scoring. C is the number of outs. B/(B + C) serves as the estimate of the number of runs that will score. Putting it all together, the construct for BsR is:

BsR = A*B/(B + C) + D

Or if its easier for you to see this way:

BsR = A*(B/(B + C)) + D

An important note here is that the use of B/(B+C) is not an inevitable one. Any formula that accurately estimates the percentage of baserunners that will score could be used. However, the basic B/(B+C) model developed by Smyth is the most accurate currently known. It may well be possible to improve the accuracy, but it would probably involve a much more confusing or expansive formula. The important point is that B/(B+C) is used because it has been empirically shown to work.

Other run estimators have incorporated the idea of Runs = baserunners*% who score + HR, such as Eric Van's Contextual Runs. Van modeled the scoring percentage as B/C, where B was advancement (although with radically different weights) and C was outs. Using this ratio results in poorer results when the number of outs is low, though. But BsR potentially could be improved if a more accurate model of the percentage of baserunners who score was found.

Base Runs formulas

Many different formulas for Base Runs have been created and used. This has led to some confusion about what the "true" or "official" formula was. One of the great beauties of BsR is that it is very flexible and the basic construct can lead to many different versions. But in the interest of alleviating some of the confusion, Smyth published three versions, each designed to work with different datasets. The most basic of these is:

A = H + W - HR
B = (1.4*TB - .6*H - 3*HR + .1*W)*1.02
C = AB - H

Another version included all of the offensive events (contained in the official statistics) with the exception of sacrifices:

A = H + W + HB - HR - .5*IW
B = (1.4*TB - .6*H - 3*HR + .1*(W + HB - IW) + .9*(SB - CS - DP))*1.1
C = AB - H + CS + DP

Finally, a version applicable with official pitching statistics:

A = H + W - HR
B = (1.4*TBe - .6*H - 3*HR + .1*W)*1.1
C = 3*IP
Where TBe = 1.12*H + 4*HR

Another important version of the formula was published by Tango Tiger. He developed it from the use of 1974-1990 play-by-play data from Retrosheet, so it includes many categories that aren't included in the official statistics. It is best that you read Tango's explanation of this formula if you are interested, so please visit his article.

Applying BsR

There are many different ways to apply BsR, and this section does not purport to examine all of the possibilities. There are some basic principles to lay out, though. The key is that BsR should NOT be applied to individual hitters. Base Runs models the run scoring of a team. Individual players do not act as entire teams--they act as one part out of nine in a team. Barry Bonds' walks do not interact directly with his home runs--they interact with the home runs of his entire team. So it is wrong to apply BsR to individual hitters, just as it is to apply RC to individual hitters. It is true that applying BsR to individual hitters will often result in a decent estimate and will do better then RC (because the flaws of RC combine with the incorrect application to produce even worse results), but it is not recommended. In general, it will serve to overrate good offensive player's run production and underestimate for bad players.

But by the same logic, you should apply Base Runs to individual pitchers, because when a pitcher is in the game, his performance interacts with no other pitchers. He is the lone pitcher for his team, and he dramatically affects the run environment he pitches in--in fact, he alone determines it (to the extent that a pitcher can). Linear Weight formulas, as discussed earlier, are designed to work in normal major league team contexts. Replacing an average hitter with even an extreme hitter like Bonds does change the run environment, but generally not enough to severely impact the accuracy of the LW estimate. This is not at all true for pitchers. A team that hit all the time as the average batter does against Johan Santana would be laughed out of the league and would fall outside of the range of best accuracy for LW formulas. Base Runs attempts and does a good job at adapting to these extreme circumstances and should be applied to pitchers and teams, but not individual batters.

Versions of BsR Used on This Site

On this site, I use three BsR formulas: one that incorporates only the basic offensive events, one that incorporates SB and CS, and a third that incorporates all of the official offensive categories. I do not claim these formulas to be more accurate or "better" then others--in fact, they are probably less accurate then other formulas. However, they still are very accurate at estimating runs scored and can be used without too much concern. I have used them for the examples of other concepts involving BsR on this page and in the accuracy test published here.

While the A, B, C, and D factors all have straightforward definitions, this does not make the choice of which events to put in them inevitable. David Smyth, Tango Tiger, Robert Dudek, myself, and possibly others have developed BsR versions and have used different philosophies to guide what to include in each factor. For instance, Smyth once published a version with D = HR + SF, since like HR, SF are guaranteed runs. Another common quandary is whether CS should be a loss of a baserunner, an additional out, or both. As we will see later, there are also advantages to giving each event a B value, even if it has been included in the other factors.

In the versions presented here, I have used the following thinking to guide my choices. I don't claim these choices as the correct or best choices, but to me, they are the most logical and easiest to work with.

The A factor represents "final" baserunners. What I mean by this is that it is number of baserunners that, as far as we can tell from the official statistics, were not retired once they reach base. So, in versions that utilize those stats, caught stealings and double plays are removed because those runners are known to have been out.

The B factor, which represents advancement as always, includes all events with the exception of outs in the first two versions. However, the "full" version that incorporates all of the official offensive categories puts every event in the B factor, as this greatly helps to balance the formula and makes it much easier to construct.

The C factor includes batting outs; outs made by BATTERS. So CS and DP are not batting outs; the baserunner was caught stealing and the fact that the batter was retired on the double play was already accounted for in his AB-H total. But SH and SF are batting outs.

The D factor is home runs, always. While it is true that we know for each SF a run will score, I consider this an accident of the official statistics and not a fundamental facto of baseball. For instance, we could also easily have an official statistic for "RBI Groundouts". But we do not. And suppose the statistics broke down each hit type into "RBI Singles" or "Non-RBI Triples", etc. If we put each of these events into D, we would eventually wind up with a formula that eventually just said that Runs = Runs. For this reason, I do not consider SF as a "guaranteed run" under the BsR definition. Maybe you could define D as "guaranteed runs created without the use of a baserunner", since the batter who hits a home run does not become a baserunner and the SF requires a runner on third base to result in a run.

Based on these underpinnings, here are the formulas used on this site for Base Runs (I should point out that the basic and SB versions were actually originally published by David Smyth a few years ago, but I have continued to employ them):

BASIC

A = H + W - HR
B = (2*TB - H - 4*HR + .05*W)*.78 = .78*S + 2.34*D + 3.9*T + 2.34*HR + .039*W
C = AB - H

STOLEN BASE

A = H + W - HR - CS
B = (2*TB - H - 4*HR + .05*W + 1.5*SB)*.76 = .76*S + 2.28*D + 3.8*T + 2.28*HR + .038*W + 1.14*SB
C = AB - H

FULL

A = H + W + HB - HR - CS - DP
B = .777*S + 2.61*D + 4.29*T + 2.43*HR + .03*(W + HB - IW) - .747*IW + 1.30*SB + .13*CS + 1.08*SH + 1.81*SF + .70*DP - .04*(AB - H)
C = AB - H + SH + SF

An alternate B factor incorporated a different value for strikeouts then other outs:

B = .781*S + 2.61*D + 4.28*T + 2.42*HR + .034*(W + HB - IW) - .741*IW + 1.29*SB + .125*CS + 1.07*SH + 1.81*SF + .69*DP - .029*(AB - H) - .086*K

Determining the B Factor

Since the B factor is where the most estimation is involved (in fact, if you follow a strict definition of the factors as I did above, it is the only place where you have any choices t make in developing a formula), it is often possible to improve accuracy by tweaking it. Also, if one wishes to perform a regression equation to find B coefficients, he would need to know the actual B value necessary to equal runs scored for the entity (team or league generally, but an individual player or any combination of baseball data could be considered an entity as well) in question. Here are two equivalent methods to determine what I will call ActB, the actual B factor.

The first is just to do algebra to rearrange the formula R = A*B/(B + C) + D to solve for B. You wind up with B = (R - D)*C/(A - R + D). A second way is to determine the actual percentage of baserunners that score, which I'll denote as Z. Z = (R - D)/A, which leads to B = Z*C/(1 - Z).

To adjust the B factor of a given formula, just find the value of your formula B for the entity in question and call it EstB. Then ActB/EstB is multiplied by the B coefficients you have, and then you wind up with a new B formula for your entity.

Accuracy of BsR

Various questions have been raised about the accuracy of BsR. Some people have claimed that since Base Runs purports to be accurate in extreme contexts, it must necessarily give up accuracy with normal teams. Other people are caught in the "accuracy trap"--they claim that the best run estimator is the one with the lowest Root Mean Square Error (RMSE) when applied to normal team data.

I will address the later viewpoint first. Almost by definition, the highest accuracy in terms of RMSE will come using a linear multiple regression equation for runs. However, regression is a purely statistical tool and does not consider the fundamental facts of baseball as BsR does, or even to a lesser extent the human developers of other run estimators have. Related to this, regression equations are tailored specifically to idiosyncrasies within their dataset and will not hold up when applied to a different dataset (although a larger sample size does help). While regression equations can be useful, very few people who are in the camp of "lowest RMSE" advocate using regression equations. This causes me to question their true adherence to this belief. Methods like Extrapolated Runs attempt to blend results from regression equations, skeletons, and empirical linear weights (see the "Linear Weights" article on this site), often sacrificing theoretical accuracy for results. A hybrid method like XR does test with greater accuracy then BsR in general, but at what cost?

Other run estimators simply cannot be trusted in their estimations in extreme contexts. Base Runs has its flaws too, but is generally a much better estimator across the entire spectrum of production. Methods like XR may be more accurate, slightly, on normal teams, but are far, far less accurate on extreme teams. This may be a trade that you are willing to make, depending on your needs, but it is not an inevitable one to every sabermetrician.

As to the claim that Base Runs does not have comparable accuracy when applied to regular teams as other run estimators, this is simply not true. The Stolen Base version of BsR presented above has a lower RMSE when applied to 1961-2004 data (excluding the strike-shortened seasons of 1981 and 1994) then does Stolen Base RC, ERP, Equivalent Runs, or Ugly Weights. The only methods which beat it in the test, which can be seen on the "Accuracy" page on this site, were a regression equation based on those teams, and XR. Base Runs' accuracy with actual teams is comparable to any of the other run estimators that have been published, and in many cases, better.

Writing Base Runs as a Rate

It is often helpful to be able to write Base Runs or other run estimators in terms of rates rather then raw numbers. The easiest way to do this is to calculate BsR/PA. To do this, simply divide the A, B, C, and D factors by plate appearances (figure PA using whatever data is used in the specific BsR equation you are using). I call A/PA "Runners On Base Average" (ROBA), B/PA "Advancement Factor" (AF), C/PA "Out Average" (OA), and D/PA simply as Home Runs per Plate Appearance (HRPA). Then BsR/PA is very simple:

BsR/PA = ROBA*AF/(AF + OA) + HRPA

One advantage of the Basic BsR employed on this page is that it is written without knowing singles, doubles, and triples specifically, just hits, total bases, and home runs. This may not result in the most precise equation, but it does allow the rate stats above to be written in terms of BA, OBA, SLG, and HRPA (where OBA is just (H+W)/(AB+W)):

ROBA = OBA - HRPA
AF = ((2*SLG - BA)*(1 - OBA)/(1 - BA) - 4*HRPA + .05*(OBA - BA)/(1 - BA))*.78
OA = 1 - OBA

Linear BsR

Base Runs, as already discussed, is a multiplicative formula. However, there are many advantages to Linear Weight formulas, including their ability to be used as a measure of individual hitter performance. Since Base Runs is an accurate estimator of run scoring across a wide range of contexts, we can use it to estimate linear weight values across a similarly wide range of contexts.

When Base Runs or any other estimator, evaluates a certain set of data, it intrinsically weights the various events. Because BsR is a multiplicative formula, the intrinsic weights will vary from entity to entity and from context to context. This gives us custom, dynamic linear weights--if we can find the intrinsic weighting used in the estimation of each entity's runs scored.

One way to do this is to take the data that we have for an entity, add a certain number of events to it, recalculate BsR, and then find the difference between our new estimate and the original estimate, and divide by the number of events added to find the value of each event added. That is a mouthful, so I will spell it out more clearly with an example.

Take the famed 1961 Yankees as an example. They had 987 singles, 194 doubles, 40 triples, 240 home runs, 543 walks, and 4098 outs. Plug this into the basic BsR formula:

A = H + W - HR = 1764
B = (2*TB - H - 4*HR + .05*W)*.78 = 1962.597
C = AB - H = 4098
D = HR = 240

Plugging this into BsR, we get 1764*1962.597/(1962.597 + 4098) + 240 = 811.2343368

Now suppose we added 10 singles. A would increase to 1774, B would increase to 1970.397, and C and D would remain the same. Our new BsR estimate would be 816.0144364, a difference of 4.780099617. Since we added 10 singles, each single would be worth .4780099617 runs. That is the LW per single of 10 added singles for the 1961 New York Yankees.

This doesn't truly isolate the value of a single in the Yankees true context, though, because when we add ten singles, we change the context and we affect all of the other values. The larger the change in context, the further we get from an estimate that relates to the actual context. If we added 1000 singles, for example, we would raise the Yankees Batting Average from .263 to .375. This would radically change the context and the estimate of the additional value of each single would have almost no connection to the original context we wanted to evaluate.

Now, the differences for ten singles will probably not be that bad. But if we want more precision, we should add less events. So let's add just one single. This is what I and others in the past have used to evaluate linear weights from multiplicative formulas and called the "+1 method". If we add one single to the Yankees, we find a LW value of .477405417.

Adding one single still changes the context, though. So let's add progressively less singles and see what happens. If we add .1 singles, the LW value is .477344886. If we add .01, it is .477338832. If we add .00001, it is .477338142. As you can see, the values are changing less and less each time. But we still have not completely isolated the value of a single for the 1961 Yankees, because we are still changing the context, albeit by a very small amount.

What we really want to do is add the smallest amount of singles that we possibly can; we want an infinitesimal number of singles. We want to find the change in LW per event added as we add an event that is almost zero. What we want, mathematically speaking, is the limit of the change in LW, divided by X, as X approaches zero, where X is the number of events we add. This concept is called the derivative in calculus.

Since Base Runs has multiple variables, we need multivariable calculus to find this limit. This is done through a technique called "partial differentiation". I am not a calculus teacher, and so I cannot explain all of the details of how to do this with BsR. What I can do is give you a formula that you can apply.

Let A, B, C, and D be the totals calculated for our entity from the A, B, C, and D formulas, and let a, b, c, and d be the coefficient for each event in the A, B, C, and D formulas we are using (zero if the event is not included). Then the Linear Weight of a given event is equal to:

LW = ((B + C)*(A*b + B*a) - (A*B)*(b + c))/((B + C)^2) + d

When you find the coefficient of each event in each factor, you need to look at the full, expanded equation for each factor. Take A for example. A = H + W - HR. But H = S + D + T + HR. So actually, if you expand it, A = S + D + T + HR + W. So the coefficient of each of those events is 1. The HR coefficient is NOT -1, because H + W - HR is just an easy way to write what we actually mean, which is S + D + T + W. This can be tricky, so you need to fully expand each factor to find a, b, c, and d for each event.

Anyway, applying this to the 1961 Yankees, we find that the LW value of a single is (technical note: the linear weights I am referring to here are absolute linear weights, not the kind that are calculated directly from run expectancy) .47738159. This is the value that our values were converging towards.

With this formula in hand, we can calculate the linear weight values for any entity with any BsR version. I have provided a spreadsheet to do this with the official offensive statistics (the older BsR article on this site provides a spreadsheet to use with Tango's expanded BsR formula using Retrosheet data). I have already entered the coefficients for my basic version coupled with composite 1961-2004 data (excluding 1981 and 1994). I do not have all of the event frequency information, but you could fill that in for that dataset or any other if you desire.

Using the basic formula from this page on the 1961-2004 data gives these LW for S, D, T, HR, W, O: .475, .805, 1.135, 1.494, .319, -.095. We will use these values later.

There is another very useful application of this concept. As discussed previously, the B coefficients are the only ones we need to test to find in most cases after we have defined what goes in A, C, and D. Sometimes, though, you know what Linear Weights you would like to generate for the entity as a whole. If you do, you can find the exact B coefficient that you need to produce it for each event through this formula:

b = ((B + C)^2*(L - d) - B^2*a - B*C*a + A*B*c)/(A*C)

Where L is the Linear Weight value you want to get for the event in question. B here is the Exact B that you calculate from actual runs scored, A, C, and D, as you do not yet have B coefficients for each event and therefore cannot compute B. I have provided a spreadsheet which you can use to do this as well.

Unfortunately, all of the events included in any of the factors must be included in the B factor in order to properly reconcile. This can be cumbersome as you often don't want to include outs or some other event in B, but it is a necessity if you want the precise B coefficients.

Known Limitations of Base Runs

This section is not meant to be comprehensive, it is just meant to be a quick discussion of a few of the problems that have been discovered in Base Runs. While it is the author's strong opinion that Base Runs is the most powerful run estimator yet created, because of its applicability across a wide range of contexts, its ease of customization, and its accuracy with regular teams that is comparable to that of any other method, it would be dishonest and unhelpful to pretend that the method is without flaws. These are just a few of the KNOWN issues with Base Runs that may or may not be symptoms of the same underlying problem.

Both of these were discovered by Tango Tiger. The first was detailed in his three part series on run estimators. Base Runs overestimated the run value of events in the approximate range of .500-.800 OBA. The second flaw was that at certain extreme levels of offense, Base Runs failed to follow the obvious baseball truth that the number of runners left on base must be capped at 3.

No advocates of Base Runs claim that it is perfect. However, it does have a logical construction that follows known "laws" of baseball. The area where the accuracy of Base Runs could be enhanced is through a better estimator of the score rate. However, a future solution would almost certainly increase the complexity of the formula. B/(B+C) is a very simple but very effective estimator. However, I look forward to the day when some sabermetrician might correct some of the flaws in Base Runs through a more complex score rate estimate. Whatever the future holds for Base Runs, David Smyth should be remembered for providing the first real new advance in run estimators in over a decade.

Monday, March 02, 2020

Tripod: Base Runs

See the first paragraph of this post for an explanation of this series.

Breaking Down BsR

It is sometimes useful to write a stat like Base Runs in rate form. It helps greatly in making the Theoretical Team equations, for one thing, and it is also useful to be able to write BsR completely in terms of BA, OBA, SLG, and HR/PA. To do this, you need to start with each component and divide it by PA. So, A/PA, B/PA, C/PA, and D/PA. (Since I am using a basic version of Base Runs, you need PA=AB+W). You can call these, resepectively, Runners On Base Average(ROBA), Advancement Factor(AF), 1-OBA, and HR/PA. Then

BsR/PA = ROBA*AF/(AF+1-OBA)+HR/PA

For the Basic version I use, these are the equations for each component:

ROBA = (H+W-HR)/(AB+W) = OBA-HR/PA
AF = (2*TB-H-4*HR+.05*W)*.78 = ((2*SLG-BA)*(1-OBA)/(1-BA)-4*HR/PA+.05*(OBA-BA)/(1-BA))*.78
1-OBA = 1-(H+W)/(AB+W)
HR/PA = HR/(AB+W)

In the Base Runs article linked above, I gave the equations that I use for each factor in this basic version. The B multiplier is based on the composite MLB stats of 1946-1995. In this period, the average for each components are:

ROBA AF OBA HR/PA
.303 .308 .325 .0222

You can use these to put together the Theoretical Team factors. The TT concept, which I will not explain here in every detail, is that since Base Runs(or Runs Created) is a run estimator devised for estimating team runs, there is an interactivity between the values of the offensive events. As the offensive production increases, the value of each event goes up(with the exception of the special case, HR). So applying BsR to Babe Ruth gives him an unfair advantage because he is not playing on a team by himself; he is playing on a team with 8 other players. So the TT formula puts the the player on a team with 8 average players. So, we assume that each player on the theroetical team gets the same number of PA as our player. So the teams new A factor can be calculated as (A+LgROBA*PA*8), where A is the individual's A factor. So you apply this technique to the B, C, and D terms, using the long term averages above(you really should have a seperate version each year, but small changes in ROBA, AF, etc. don't significantly change the results of the formula).

Then, to see how much the player has helped this team, we compare him to a team of 8 average players in his number of PA each. If we wanted to compare the player to the league average, we would compare him to 9 average players. If you work all this out and simplify, you get this equation for TT BsR, which I like to call Individual Base Runs(IBR).

IBR = (A+2.42PA)(B+2.46PA)/(B+C+7.86PA)+HR-.76PA

Lest it seem as if I am taking credit for coming up with all of this, the pioneering TT work was done by Dave Tate and Bill James, and the application of the TT concept to BsR was also the work of David Smyth.

Stolen Base BsR

It is useful and necassary to get some more categories into a Runs Created formula, and so here we'll put SB and CS in(this is again based on Smyth's work). The other categories we could add, like SF, SH, and DP, I choose to ignore. For one, they are very situation dependent and therefore I'm not 100% comfortable in including in an individual formula, and secondly and more importantly, I am lazy and don't want to deal with them. Anyway, for BsR including SB:
A = H + W - HR - CS
B = (2*TB - H - 4*HR + .05*W +1.5*SB)*.76
C = AB-H

The IBR formula for the standard league is:

IBR =(A+2.34PA)(B+2.58PA)/(B+C+7.98PA)+HR-.76PA

ROBA and AF are no longer the rate stats; I call these AROBA and AAF for "advanced". Anyway, the long term averages are:
AROBA AAF OBA HR/PA
.293 .323 .325 .0222

Full BsR

Here is a version of the BsR formula that you can use if you have all of the minor(SH, SF, DP, etc.) offensive stats. It is not as clean and nice looking as the other versions on this page, but there needs to be more of a give-and-take between the various events when you include the other stats. It is also not straightforward as to which events should be placed in which factor(s). I took the convention that A is final baserunners; baserunners less those who we know have been thrown out on the bases or taken out on a DP. Everything goes in B to balance everything out and produce good linear weights, while C is batting outs. D remains home runs. There are other ways to define these terms and Smyth, TangoTiger, and Robert Dudek have all done these in different ways then I have. There are certainly arguments to be made for all of the differnt approaches, but a discussion of that will have to wait for another day.

A = H + W + HB - HR - CS - DP
B = .777S + 2.61D + 4.29T + 2.43HR + .03(W + HB - IW) - .747IW + 1.30SB + .13CS + 1.08SH + 1.81SF + .70DP -.04(AB-H)
C = AB - H + SH + SF

If you want to include strikeouts, they go in this B factor which is coupled with the A and C factors given above: B = .781S + 2.61D + 4.28T + 2.42HR + .034(W + HB - IW) - .741IW + 1.29SB + .125CS + 1.07SH + 1.81SF +.69DP - .029(AB-H-K) - .086K

Finding the B Multiplier

The B multiplier is designed so that the BsR formula will produce the correct number of runs for the entity you are using. This is because A as baserunners, C as outs, and D as Home Runs, all are straightforward and obvious formulas.

You can calculate, based on A, C, and D, the actual B factor required to equate BsR with R, by this formula: (R-D)*C/(A-R+D). What can you do with the actual B value? For one thing, if you already have a set formula for B(ignoring the multiplier), you can divide actual B by estimated B to get the correct multiplier. Another thing you can do is run a regression to find weights for TB, H, etc. by using those stats to predict Actual B, or use other approaches like trial and error, etc. All of these approaches had a role in finding the B component used in the official versions of BsR.

An alternate way to find B is to calculate Z=(R-D)/A, then B=Z*C/(1-Z). It is the same thing, and longer and more complicated, but it is equivalent. (I include it because it was the way I did it until I took the time to work out the algebra to derive the other formula).

Building the TT BsR Formula

Here are the technical steps to be building the TT formula. These are not very interesting for most people, but hard core sabermetricians may find them useful(although hard core sabermetricians probably already know how to do it themselves):

IBR can be written as:
(A+X*PA)(B+Y*PA)/((B+Y*PA)+(C+Z*PA))+HR+T(PA)-(V)PA which simplifies too:
(A+X*PA)(B+Y*PA)/(B+C+(Y+Z)PA)+HR-(V-T)PA
where X is the remainder of team ROBA
Y is the remainder of team AF
Z is the remainder of team 1-OBA
T is the remainder of team HRPA
V is the R/PA for the comparison lineup multiplied by the number of players
in the comparison lineup

OK, since we always add the player to a team with 8 average playes:
X = LgROBA*8 Y = LgAF*8 Z=(1-LgOBA)*8 T = LgHR/PA*8

Depending on what baseline we use though, V will vary. For absolute runs, we compare the player to a team with 8 average hitters. For runs above average, we compare the player to a team with 9 average hitters. For runs above replacement, we compare the player to a team with 8 average hitters plus one replacement level hitter. So, it is very straightforward to find V for absolute: 8*LgBsR/PA. For average, V = 9*LgBsR/PA.

For replacment, we need to first set a replacement level, and then determine what ROBA, AF, OBA, and HRPA a replacement player will have. I assume 25 batting outs(AB-H)/G, and use BsR/PA to calculate the R/G for the league. (BsR/PA)/(1-OBA)*25, since BsR/O = (BsR/PA)/(1-OBA). (Keeping in mind that BsR/PA = ROBA*AF/(AF+1-OBA) + HRPA). Then, I assume the replacement rate is 1 run/game below average, so I take that R/G, subtract 1, and divide by 25. This is the replacement player's R/O. In the standard league we are using, the BsR/PA = .117, R/O = .173, and RepR/O(R/O for the replacement) = .133. Then we need to find the value, X, by which the each component stat for the league(ROBA, AF, OBA, and HRPA) needs to be deflated by for R/O to equal .133. We multiply each term in the BsR/O formula by X. This, when simplified, gives this equation:
RAX^2/(1+X(A-O)+HX)/(1-OX) = Rep R/O

R is LgROBA, A is LgAF, O is LgOBA, and H is LgHRPA. I have no idea how to solve for X by hand, but my TI-83 calculator will do it, and it gives .89 for the standard league(this will all vary based on the league offensive levels, and of course how you personally choose to define replacement rate). Any way, we then multiply each component by .89 to find we expect our replacement to hit:
ROBA AF OBA HRPA
.269 .274 .289 .02

So this gives him a BsR/PA of .095. We then calculate the V value for the replacement baseline as 8*LgBsR/PA+RepBsR/PA. Here is a chart showing the values you need to fill in for the TT components at each baseline in the standard league:
BASELINE X Y Z T V V-T
Absolute 2.42 2.46 5.40 .178 .937 .759
Average " " " " 1.054 .876
Replacement " " " " 1.031 .853

If you want to get more complex, there is something that we have failed to adress. That is that if you really add a player to a team, he will change the number of PA everyone in the lineup gets. A player with a higher OBA than his teammates will generate more PA; one with a lower one will generate less. In the TT formula above, we have held PA constant. What if we let them vary? We can calculate the OBA the team would have with the player as 8/9*LgOBA+1/9*OBA. Call this Q. Then, figure (1-LgOBA)/(1-Q). Call this PAR of PA-added ratio. Then, multiply every individual term(the new A, the new B, the new C, and the new D), by PAR, and proceed as usual.

Is this worth it? Who knows. Some of these bells and whistles might wash out when you convert them to win values. Maybe they don't. A straight linear system, though, might be correct, and it will help you keep your sanity.

Fundamental Structure of BsR

The fundamental structure of BsR is its key asset. That fundamental structure is based on the simple, undeniable truth that runs scored = baserunners*% of baserunners who score + home runs. "Basrunners" does not include home runs. Anyway, in BsR, the A factor represents baserunners and the D factor represents home runs. The % of baserunners who score, which we'll call score rate, is estimated as B/(B+C), where B is advancement and C is outs.

Other run estimators are not backed up by a fundamental theory of how runs are scored. Runs Created's downfall is its failure to account for the unique nature of the HR(that it always produces at least one run, and if it occurs by itself, it will produce only one run). Static LW formulas fail to account for the fact taht the value of each event varies based on the context. BsR is based on a true equation of how runs are scored. That does not mean, though, that BsR is the one true correct run estiamtor by any stretch. The equation of B/(B+C) to estimate score rate has good empirical accuracy, but also has been found to not work very well in some circumstances(such as OBA between .500 and .800--see Tango's article on Primer about this). Maybe score rate should be estimated in a totally different way. But the structure of the BsR equation is sound. If we want a better run estimator, we need a better estimator of score rate.

Linear BsR

You can figure how a non-linear RC formula values each event in the context you are interested in(it can be the league, a specific team, or even a hypothetical lineup of the same player over and over again). All you have to do is calculate BsR for the entity, and then add one single, recompute BsR, and subtract the first figure. This is the value of one additional single. Then you do the same with every other event, and you'll have LBsR. You have to be careful to account for everywhere the event is involved; for example, a single not only adds a hit but also a Total Base and an At Bat. If you run the LBsR for the long-term stats, you get these values:

LBsR = .48S+.81D+1.14T+1.50HR+.32W-.096(AB-H)
LBsR(sb) = .47S+.77D+1.07T+1.45HR+.33W+.23SB-.41CS-.093(AB-H)

Of course, you could add something other than one. You could subtract one, or add 10, or add 15. The further you get away from 0, the more the results will vary. Adding 1000 singles will have a much different effect, even per single, then adding 1 single. Really, as Tango has pointed out, we want to get as close to adding 0 singles as possible. Adding .00001 singles changes the run enviornment and the values of the other events very little, and that is what we are looking to do. It is sort of like a limit in calculus. Actually, I guess that's exactly what it is. We want to find the limit of (new BsR minus old BsR) divided by X, as X approaches 0, where X is the number of the event that we are adding. Somebody who knows a lot about calculus could probably tell me if I'm right about that, and if so, come up with a formula to calculate the limit precisely instead of having to do trial and error in a spreadsheet.

I have included a spreadsheet which runs through this approach for the 1979 Pirates. You can change the data in cells B2 to G2 to whatever you want to do this with other entities. Anyway, I show the LW generated by adding 10 of each event, 1 of each event, .1 of each event, etc. and the same for -10, -1, -.1, etc. I have highlighted in pink the positive and negative points at which the convergence, the limit, occurs. If you go past that(I put it at one ten-millionth, 10^-6), the values start fluctuating again. My suspicion is that this is because of the spreadsheet not having perfect accuracy, internal rounding and the like, but I could be wrong. Anyway, you can see there is not a lot of difference. The +10 weight for a Pirate single for instance is .4898824, the +1 is .4892998, and the limit is around .4892350. So you really don't need to do that, but it is nice to illustrate the property.

Added 4/7/04: Using calculus, you can figure this precisely using partial derivatives. The value of the single for instance is equal to the partial derivative of the BsR function with respect to singles. You can still do this even if you don't know calculus, because the math works out simple with BsR. The formula winds up being:

((B+C)*(A*b+B*a)-(A*B)*(b+c))/((B+C)^2)+d

Let A, B, C, and D be the respective total factors for the entity you are interested in. Let a, b, c, and d be the A, B, C, and D coefficients of the event you are interested in. That's it. Thank goodness all of the formulas for the pieces of BsR are linear.

There is a spreadsheet linked at the bottom that shows this. It is based on Tango's full BsR which is available at the link.

If you don't want to deal with a category, just set the coefficients to 0. You can change the coefficients for the other events to use any BsR equation you want all with this spreadsheet. Of course you can also change the "#" column, which is the frequency of the event for the dataset you're using. Enjoy.

Matching LW Values

Based on the formula above to calculate the Linear Weight value of a certain event using BsR, you can also fix the B coefficients so that they produce desired LWs. For example, on my LW page there is the ERP formula that I use, based on 1951-1998 composite major league data. Suppose I want to force my BsR formula to produce the same LW as are used in ERP. How do I go about doing this?

Well, first, I have to clearly define which events are in the A, C, and D factors, and what coefficient they have there. For my case, I will use S, D, T, HR, W, and O as the only events. S, D, T, and W each have a coefficient of 1 in A; O has a coefficient of 1 in C; and HR has a coefficient of 1 in D.

Now, I we need to calculate the A, C, and D factors for the entity I am working with(in my case, all teams 1951-1998). Then, I use these to calculate what we will call B--the actual B value required for BsR to equal runs scored. The formula for ActB is (R-D)*C/(A-R+D), where R is the actual runs scored we want to match.

So, now we have everything we need. a, b, c, and d are still the coefficient for the given event in the respective factors. And we can calculate b as:

B = ((B+C)^2*(L-d)-B^2*a-B*C*a+A*B*c)/(A*C)

Voila. So, let's look at my ERP equation. It is (TB+W+.5H-.3(AB-H))*.324, which as LW for S, D, T, HR, W, O is .486, .81, 1.134, 1.458, .324, -.0972. The B that I use for BsR((2TB-H-4HR+.05W)*.78) is:

B = .78S+2.34D+3.9T+2.34HR+.039W

Now, with all of this data, we can force the LW values. When we do this(which you can do with the spreadsheet linked at the bottom of the page, the same one that gives the actual LW values), it seems to give a result that's decent to .001 or so. It might be rounding error, or it might be something else, but either way, it's pretty close. So, to match the linear weight values I wanted, my B would be be:

B = .833S+2.360D+3.888T+2.159HR+.0692W-.010(O)

Yes, the outs have to be included as well. That's kind of cumbersome if you don't want outs in B, but it's necessary to force the values. Are you sufficient confused yet? I am.

1979 Pirates

Full BsR LW

Walk Like a Sabermetrician

Monday, March 23, 2020

Tripod: Theoretical Team Base Runs

Tuesday, March 10, 2020

Tripod: Base Runs II

Monday, March 02, 2020

Tripod: Base Runs

Me, Elsewhere

Analysis Links

Reference Links

Blog Archive

OSU Baseball

End of Season Statistics

Win Shares Walkthrough

NL 1876-1881 Series

Labels

About Me