Walk Like a Sabermetrician: Tripod: Base Runs II

See the first paragraph of this post for an explanation of this series.

This is the second page on Base Runs that I have written up for this site. It makes no attempt to cover new concepts that weren't addressed in the original page. What it does try to do is write-up the information from the first page in a more accessible way. I have seen comments that the Base Runs page on this site is hard to understand. Unfortunately, the main cause of this problem is probably my writing style and skill (or more appropriately lack thereof). However, it is true that the original page was created by adding on new concepts as time passed, and therefore is somewhat of a hodge-podge of different ideas, written at different times, without a comprehensive master plan in mind. This page will attempt to address this.

Philosophy and Origins of Base Runs

Base Runs is a run estimator developed in the early 1990s by David Smyth. Like Runs Created, BsR is designed to estimate the number of runs that a team would score. Methods of this type attempt to incorporate the interactive effect of offensive events. A linear weights formula like Extrapolated Runs or Estimated Runs Produced (or even, essentially, Clay Davenport's Equivalent Runs) applies a static run value to each event. Usually these formulas weight walk at around 1/3 of a run. And in most circumstances, this is a good estimate of the number of runs that will result from a walk. But in a game in which a team draws a walk and makes 27 outs, the walk will not have the same value. In fact, since estimators like ERP apply a value of about -1/10 of a run for every out, it will predict somewhere in the neighborhood of -2.4 runs for that game. This answer is obviously wrong.

The reason why this is that linear formulas are designed to work with a certain range of data that corresponds to the range in which normal major league teams perform. When you apply the method in these contexts, it will give very accurate estimates. But when you attempt to take the method outside of the context in which it was developed for, problems will result. None of this is meant to put down linear formulas which are very useful in sabermetrics. It only stands to illustrate the much more difficult task that BsR or RC attempt to perform. Ideally, they should be models of run scoring that work over a wide range of contexts and can give an accurate estimate for unusual or extreme situations. Another way to look at this is that Base Runs generates custom linear weights that are intrinsically generated and then applied in all situations.

Unfortunately, Runs Created does a very poor job of estimating in extreme contexts--in fact, in many cases poorer then linear methods! The reason for this is that while RC is constructed based on reasonable principles of how an offense it works, it does not recognize certain constraints on the number of runs that will be scored.

For example, a home run will always produce at least one run. It does not matter if every other batter has made an out, the team will get a run if they go deep. The Basic RC formula of (H+W)*TB/(AB+W) would predict (1+0)*4/(28) = .14 runs for a team that hit a homer and made 27 outs. But we know that they must score at least one run.

Furthermore, if all you do is hit home runs, each home run will produce just one run. Suppose a team entered the bottom of the ninth trailing by two runs, and the first two batters hit home runs. RC would predict (2+0)*8/2 = 8 runs, or 4 from each home run. This is another impossibility.

Another "known point" is the case of all outs. You will score zero runs if all you do is make outs, but you cannot wind up with negative runs. RC correctly predicts zero runs, but all linear methods must predict negative runs below a certain level of production in order to have any accuracy in normal levels.

Base Runs gives much more reasonable estimates in these extreme circumstances. This is because it starts with a true model of how runs are scored. Each batter that comes to the plate will eventually do one of three things: make a batting out, hit a home run, or reach base. Once he has reached base, there are three more potential outcomes: he will score, make an out on the bases, or be left on base at the end of the innings. Simplifying further, an identity for the number of runs scored can be written as Baserunners * % of base runners who score + Home Runs. This is an undeniably true statement. BsR uses this model to derive an estimate of runs scored.

Although the identity is undeniably true, the estimates that the formula uses are not. If we are given a team's offensive statistics but not their runs scored, we can never know for sure what percentage of baserunners will score--if we knew this, we would have a method with 100% accuracy. We do know for sure the number of home runs, and we do have a very good estimate of the number of baserunners (but we don't know, for instance, how many runners will be retired stretching doubles into triples). It is the percentage of baserunners who score that involves an estimate that is not assured of being almost 100% correct, and therefore this component is a crucial determinant of the accuracy of the estimate.

Smyth broke his formula into four factors denoted as A, B, C, and D. A is simply the number of baserunners. D is simply the number of home runs. B is the "advancement factor", representing the advance of baserunners towards scoring. C is the number of outs. B/(B + C) serves as the estimate of the number of runs that will score. Putting it all together, the construct for BsR is:

BsR = A*B/(B + C) + D

Or if its easier for you to see this way:

BsR = A*(B/(B + C)) + D

An important note here is that the use of B/(B+C) is not an inevitable one. Any formula that accurately estimates the percentage of baserunners that will score could be used. However, the basic B/(B+C) model developed by Smyth is the most accurate currently known. It may well be possible to improve the accuracy, but it would probably involve a much more confusing or expansive formula. The important point is that B/(B+C) is used because it has been empirically shown to work.

Other run estimators have incorporated the idea of Runs = baserunners*% who score + HR, such as Eric Van's Contextual Runs. Van modeled the scoring percentage as B/C, where B was advancement (although with radically different weights) and C was outs. Using this ratio results in poorer results when the number of outs is low, though. But BsR potentially could be improved if a more accurate model of the percentage of baserunners who score was found.

Base Runs formulas

Many different formulas for Base Runs have been created and used. This has led to some confusion about what the "true" or "official" formula was. One of the great beauties of BsR is that it is very flexible and the basic construct can lead to many different versions. But in the interest of alleviating some of the confusion, Smyth published three versions, each designed to work with different datasets. The most basic of these is:

A = H + W - HR
B = (1.4*TB - .6*H - 3*HR + .1*W)*1.02
C = AB - H

Another version included all of the offensive events (contained in the official statistics) with the exception of sacrifices:

A = H + W + HB - HR - .5*IW
B = (1.4*TB - .6*H - 3*HR + .1*(W + HB - IW) + .9*(SB - CS - DP))*1.1
C = AB - H + CS + DP

Finally, a version applicable with official pitching statistics:

A = H + W - HR
B = (1.4*TBe - .6*H - 3*HR + .1*W)*1.1
C = 3*IP
Where TBe = 1.12*H + 4*HR

Another important version of the formula was published by Tango Tiger. He developed it from the use of 1974-1990 play-by-play data from Retrosheet, so it includes many categories that aren't included in the official statistics. It is best that you read Tango's explanation of this formula if you are interested, so please visit his article.

Applying BsR

There are many different ways to apply BsR, and this section does not purport to examine all of the possibilities. There are some basic principles to lay out, though. The key is that BsR should NOT be applied to individual hitters. Base Runs models the run scoring of a team. Individual players do not act as entire teams--they act as one part out of nine in a team. Barry Bonds' walks do not interact directly with his home runs--they interact with the home runs of his entire team. So it is wrong to apply BsR to individual hitters, just as it is to apply RC to individual hitters. It is true that applying BsR to individual hitters will often result in a decent estimate and will do better then RC (because the flaws of RC combine with the incorrect application to produce even worse results), but it is not recommended. In general, it will serve to overrate good offensive player's run production and underestimate for bad players.

But by the same logic, you should apply Base Runs to individual pitchers, because when a pitcher is in the game, his performance interacts with no other pitchers. He is the lone pitcher for his team, and he dramatically affects the run environment he pitches in--in fact, he alone determines it (to the extent that a pitcher can). Linear Weight formulas, as discussed earlier, are designed to work in normal major league team contexts. Replacing an average hitter with even an extreme hitter like Bonds does change the run environment, but generally not enough to severely impact the accuracy of the LW estimate. This is not at all true for pitchers. A team that hit all the time as the average batter does against Johan Santana would be laughed out of the league and would fall outside of the range of best accuracy for LW formulas. Base Runs attempts and does a good job at adapting to these extreme circumstances and should be applied to pitchers and teams, but not individual batters.

Versions of BsR Used on This Site

On this site, I use three BsR formulas: one that incorporates only the basic offensive events, one that incorporates SB and CS, and a third that incorporates all of the official offensive categories. I do not claim these formulas to be more accurate or "better" then others--in fact, they are probably less accurate then other formulas. However, they still are very accurate at estimating runs scored and can be used without too much concern. I have used them for the examples of other concepts involving BsR on this page and in the accuracy test published here.

While the A, B, C, and D factors all have straightforward definitions, this does not make the choice of which events to put in them inevitable. David Smyth, Tango Tiger, Robert Dudek, myself, and possibly others have developed BsR versions and have used different philosophies to guide what to include in each factor. For instance, Smyth once published a version with D = HR + SF, since like HR, SF are guaranteed runs. Another common quandary is whether CS should be a loss of a baserunner, an additional out, or both. As we will see later, there are also advantages to giving each event a B value, even if it has been included in the other factors.

In the versions presented here, I have used the following thinking to guide my choices. I don't claim these choices as the correct or best choices, but to me, they are the most logical and easiest to work with.

The A factor represents "final" baserunners. What I mean by this is that it is number of baserunners that, as far as we can tell from the official statistics, were not retired once they reach base. So, in versions that utilize those stats, caught stealings and double plays are removed because those runners are known to have been out.

The B factor, which represents advancement as always, includes all events with the exception of outs in the first two versions. However, the "full" version that incorporates all of the official offensive categories puts every event in the B factor, as this greatly helps to balance the formula and makes it much easier to construct.

The C factor includes batting outs; outs made by BATTERS. So CS and DP are not batting outs; the baserunner was caught stealing and the fact that the batter was retired on the double play was already accounted for in his AB-H total. But SH and SF are batting outs.

The D factor is home runs, always. While it is true that we know for each SF a run will score, I consider this an accident of the official statistics and not a fundamental facto of baseball. For instance, we could also easily have an official statistic for "RBI Groundouts". But we do not. And suppose the statistics broke down each hit type into "RBI Singles" or "Non-RBI Triples", etc. If we put each of these events into D, we would eventually wind up with a formula that eventually just said that Runs = Runs. For this reason, I do not consider SF as a "guaranteed run" under the BsR definition. Maybe you could define D as "guaranteed runs created without the use of a baserunner", since the batter who hits a home run does not become a baserunner and the SF requires a runner on third base to result in a run.

Based on these underpinnings, here are the formulas used on this site for Base Runs (I should point out that the basic and SB versions were actually originally published by David Smyth a few years ago, but I have continued to employ them):

BASIC

A = H + W - HR
B = (2*TB - H - 4*HR + .05*W)*.78 = .78*S + 2.34*D + 3.9*T + 2.34*HR + .039*W
C = AB - H

STOLEN BASE

A = H + W - HR - CS
B = (2*TB - H - 4*HR + .05*W + 1.5*SB)*.76 = .76*S + 2.28*D + 3.8*T + 2.28*HR + .038*W + 1.14*SB
C = AB - H

FULL

A = H + W + HB - HR - CS - DP
B = .777*S + 2.61*D + 4.29*T + 2.43*HR + .03*(W + HB - IW) - .747*IW + 1.30*SB + .13*CS + 1.08*SH + 1.81*SF + .70*DP - .04*(AB - H)
C = AB - H + SH + SF

An alternate B factor incorporated a different value for strikeouts then other outs:

B = .781*S + 2.61*D + 4.28*T + 2.42*HR + .034*(W + HB - IW) - .741*IW + 1.29*SB + .125*CS + 1.07*SH + 1.81*SF + .69*DP - .029*(AB - H) - .086*K

Determining the B Factor

Since the B factor is where the most estimation is involved (in fact, if you follow a strict definition of the factors as I did above, it is the only place where you have any choices t make in developing a formula), it is often possible to improve accuracy by tweaking it. Also, if one wishes to perform a regression equation to find B coefficients, he would need to know the actual B value necessary to equal runs scored for the entity (team or league generally, but an individual player or any combination of baseball data could be considered an entity as well) in question. Here are two equivalent methods to determine what I will call ActB, the actual B factor.

The first is just to do algebra to rearrange the formula R = A*B/(B + C) + D to solve for B. You wind up with B = (R - D)*C/(A - R + D). A second way is to determine the actual percentage of baserunners that score, which I'll denote as Z. Z = (R - D)/A, which leads to B = Z*C/(1 - Z).

To adjust the B factor of a given formula, just find the value of your formula B for the entity in question and call it EstB. Then ActB/EstB is multiplied by the B coefficients you have, and then you wind up with a new B formula for your entity.

Accuracy of BsR

Various questions have been raised about the accuracy of BsR. Some people have claimed that since Base Runs purports to be accurate in extreme contexts, it must necessarily give up accuracy with normal teams. Other people are caught in the "accuracy trap"--they claim that the best run estimator is the one with the lowest Root Mean Square Error (RMSE) when applied to normal team data.

I will address the later viewpoint first. Almost by definition, the highest accuracy in terms of RMSE will come using a linear multiple regression equation for runs. However, regression is a purely statistical tool and does not consider the fundamental facts of baseball as BsR does, or even to a lesser extent the human developers of other run estimators have. Related to this, regression equations are tailored specifically to idiosyncrasies within their dataset and will not hold up when applied to a different dataset (although a larger sample size does help). While regression equations can be useful, very few people who are in the camp of "lowest RMSE" advocate using regression equations. This causes me to question their true adherence to this belief. Methods like Extrapolated Runs attempt to blend results from regression equations, skeletons, and empirical linear weights (see the "Linear Weights" article on this site), often sacrificing theoretical accuracy for results. A hybrid method like XR does test with greater accuracy then BsR in general, but at what cost?

Other run estimators simply cannot be trusted in their estimations in extreme contexts. Base Runs has its flaws too, but is generally a much better estimator across the entire spectrum of production. Methods like XR may be more accurate, slightly, on normal teams, but are far, far less accurate on extreme teams. This may be a trade that you are willing to make, depending on your needs, but it is not an inevitable one to every sabermetrician.

As to the claim that Base Runs does not have comparable accuracy when applied to regular teams as other run estimators, this is simply not true. The Stolen Base version of BsR presented above has a lower RMSE when applied to 1961-2004 data (excluding the strike-shortened seasons of 1981 and 1994) then does Stolen Base RC, ERP, Equivalent Runs, or Ugly Weights. The only methods which beat it in the test, which can be seen on the "Accuracy" page on this site, were a regression equation based on those teams, and XR. Base Runs' accuracy with actual teams is comparable to any of the other run estimators that have been published, and in many cases, better.

Writing Base Runs as a Rate

It is often helpful to be able to write Base Runs or other run estimators in terms of rates rather then raw numbers. The easiest way to do this is to calculate BsR/PA. To do this, simply divide the A, B, C, and D factors by plate appearances (figure PA using whatever data is used in the specific BsR equation you are using). I call A/PA "Runners On Base Average" (ROBA), B/PA "Advancement Factor" (AF), C/PA "Out Average" (OA), and D/PA simply as Home Runs per Plate Appearance (HRPA). Then BsR/PA is very simple:

BsR/PA = ROBA*AF/(AF + OA) + HRPA

One advantage of the Basic BsR employed on this page is that it is written without knowing singles, doubles, and triples specifically, just hits, total bases, and home runs. This may not result in the most precise equation, but it does allow the rate stats above to be written in terms of BA, OBA, SLG, and HRPA (where OBA is just (H+W)/(AB+W)):

ROBA = OBA - HRPA
AF = ((2*SLG - BA)*(1 - OBA)/(1 - BA) - 4*HRPA + .05*(OBA - BA)/(1 - BA))*.78
OA = 1 - OBA

Linear BsR

Base Runs, as already discussed, is a multiplicative formula. However, there are many advantages to Linear Weight formulas, including their ability to be used as a measure of individual hitter performance. Since Base Runs is an accurate estimator of run scoring across a wide range of contexts, we can use it to estimate linear weight values across a similarly wide range of contexts.

When Base Runs or any other estimator, evaluates a certain set of data, it intrinsically weights the various events. Because BsR is a multiplicative formula, the intrinsic weights will vary from entity to entity and from context to context. This gives us custom, dynamic linear weights--if we can find the intrinsic weighting used in the estimation of each entity's runs scored.

One way to do this is to take the data that we have for an entity, add a certain number of events to it, recalculate BsR, and then find the difference between our new estimate and the original estimate, and divide by the number of events added to find the value of each event added. That is a mouthful, so I will spell it out more clearly with an example.

Take the famed 1961 Yankees as an example. They had 987 singles, 194 doubles, 40 triples, 240 home runs, 543 walks, and 4098 outs. Plug this into the basic BsR formula:

A = H + W - HR = 1764
B = (2*TB - H - 4*HR + .05*W)*.78 = 1962.597
C = AB - H = 4098
D = HR = 240

Plugging this into BsR, we get 1764*1962.597/(1962.597 + 4098) + 240 = 811.2343368

Now suppose we added 10 singles. A would increase to 1774, B would increase to 1970.397, and C and D would remain the same. Our new BsR estimate would be 816.0144364, a difference of 4.780099617. Since we added 10 singles, each single would be worth .4780099617 runs. That is the LW per single of 10 added singles for the 1961 New York Yankees.

This doesn't truly isolate the value of a single in the Yankees true context, though, because when we add ten singles, we change the context and we affect all of the other values. The larger the change in context, the further we get from an estimate that relates to the actual context. If we added 1000 singles, for example, we would raise the Yankees Batting Average from .263 to .375. This would radically change the context and the estimate of the additional value of each single would have almost no connection to the original context we wanted to evaluate.

Now, the differences for ten singles will probably not be that bad. But if we want more precision, we should add less events. So let's add just one single. This is what I and others in the past have used to evaluate linear weights from multiplicative formulas and called the "+1 method". If we add one single to the Yankees, we find a LW value of .477405417.

Adding one single still changes the context, though. So let's add progressively less singles and see what happens. If we add .1 singles, the LW value is .477344886. If we add .01, it is .477338832. If we add .00001, it is .477338142. As you can see, the values are changing less and less each time. But we still have not completely isolated the value of a single for the 1961 Yankees, because we are still changing the context, albeit by a very small amount.

What we really want to do is add the smallest amount of singles that we possibly can; we want an infinitesimal number of singles. We want to find the change in LW per event added as we add an event that is almost zero. What we want, mathematically speaking, is the limit of the change in LW, divided by X, as X approaches zero, where X is the number of events we add. This concept is called the derivative in calculus.

Since Base Runs has multiple variables, we need multivariable calculus to find this limit. This is done through a technique called "partial differentiation". I am not a calculus teacher, and so I cannot explain all of the details of how to do this with BsR. What I can do is give you a formula that you can apply.

Let A, B, C, and D be the totals calculated for our entity from the A, B, C, and D formulas, and let a, b, c, and d be the coefficient for each event in the A, B, C, and D formulas we are using (zero if the event is not included). Then the Linear Weight of a given event is equal to:

LW = ((B + C)*(A*b + B*a) - (A*B)*(b + c))/((B + C)^2) + d

When you find the coefficient of each event in each factor, you need to look at the full, expanded equation for each factor. Take A for example. A = H + W - HR. But H = S + D + T + HR. So actually, if you expand it, A = S + D + T + HR + W. So the coefficient of each of those events is 1. The HR coefficient is NOT -1, because H + W - HR is just an easy way to write what we actually mean, which is S + D + T + W. This can be tricky, so you need to fully expand each factor to find a, b, c, and d for each event.

Anyway, applying this to the 1961 Yankees, we find that the LW value of a single is (technical note: the linear weights I am referring to here are absolute linear weights, not the kind that are calculated directly from run expectancy) .47738159. This is the value that our values were converging towards.

With this formula in hand, we can calculate the linear weight values for any entity with any BsR version. I have provided a spreadsheet to do this with the official offensive statistics (the older BsR article on this site provides a spreadsheet to use with Tango's expanded BsR formula using Retrosheet data). I have already entered the coefficients for my basic version coupled with composite 1961-2004 data (excluding 1981 and 1994). I do not have all of the event frequency information, but you could fill that in for that dataset or any other if you desire.

Using the basic formula from this page on the 1961-2004 data gives these LW for S, D, T, HR, W, O: .475, .805, 1.135, 1.494, .319, -.095. We will use these values later.

There is another very useful application of this concept. As discussed previously, the B coefficients are the only ones we need to test to find in most cases after we have defined what goes in A, C, and D. Sometimes, though, you know what Linear Weights you would like to generate for the entity as a whole. If you do, you can find the exact B coefficient that you need to produce it for each event through this formula:

b = ((B + C)^2*(L - d) - B^2*a - B*C*a + A*B*c)/(A*C)

Where L is the Linear Weight value you want to get for the event in question. B here is the Exact B that you calculate from actual runs scored, A, C, and D, as you do not yet have B coefficients for each event and therefore cannot compute B. I have provided a spreadsheet which you can use to do this as well.

Unfortunately, all of the events included in any of the factors must be included in the B factor in order to properly reconcile. This can be cumbersome as you often don't want to include outs or some other event in B, but it is a necessity if you want the precise B coefficients.

Known Limitations of Base Runs

This section is not meant to be comprehensive, it is just meant to be a quick discussion of a few of the problems that have been discovered in Base Runs. While it is the author's strong opinion that Base Runs is the most powerful run estimator yet created, because of its applicability across a wide range of contexts, its ease of customization, and its accuracy with regular teams that is comparable to that of any other method, it would be dishonest and unhelpful to pretend that the method is without flaws. These are just a few of the KNOWN issues with Base Runs that may or may not be symptoms of the same underlying problem.

Both of these were discovered by Tango Tiger. The first was detailed in his three part series on run estimators. Base Runs overestimated the run value of events in the approximate range of .500-.800 OBA. The second flaw was that at certain extreme levels of offense, Base Runs failed to follow the obvious baseball truth that the number of runners left on base must be capped at 3.

No advocates of Base Runs claim that it is perfect. However, it does have a logical construction that follows known "laws" of baseball. The area where the accuracy of Base Runs could be enhanced is through a better estimator of the score rate. However, a future solution would almost certainly increase the complexity of the formula. B/(B+C) is a very simple but very effective estimator. However, I look forward to the day when some sabermetrician might correct some of the flaws in Base Runs through a more complex score rate estimate. Whatever the future holds for Base Runs, David Smyth should be remembered for providing the first real new advance in run estimators in over a decade.

Walk Like a Sabermetrician

Tuesday, March 10, 2020

Tripod: Base Runs II

No comments:

Post a Comment

Me, Elsewhere

Analysis Links

Reference Links

Blog Archive

OSU Baseball

End of Season Statistics

Win Shares Walkthrough

NL 1876-1881 Series

Labels

About Me