Tuesday, September 28, 2010

Simple BsR Component RAs

I suppose the adjective simple might be considered a misnomer by some readers; I'm referring more to the fact that all of the formulas in this post are going to be based on the same BsR formula. To get it out of the way upfront, I am at my core a lazy and hypocritical sabermetrician. I may write about the importance of sweating the small stuff, but when it comes down to it I pick one version of an equation and stick with it up until someone (occasionally even myself) demonstrates that it is flawed to the point of uselessness.

So I have a bit of an irrational attachment to this particular Base Run equation. It's not a horrible version, but it's not the best that one can come up with given these inputs either. I like it because the coefficients are easy to remember and it doesn't need to consider doubles and triples separately:

A = H + W - HR
B = (2*TB - H - 4*HR + .05*W)*x (approximately .78)
C = AB - H
D = HR
BsR = A*B/(B + C) + D

This formula will be used as the common starting point for a family of component RAs, each based on different inputs/approaches. I will run through four variants, three of which I have already used at some point on this blog. One commonality will be the use of run average as a unit rather than ERA. I have never cared for the distinction between earned and unearned runs and have always used RA. However, even if one does prefer ERA to RA, it still makes sense to express results in terms of total runs allowed. After all, the score is not kept in terms of earned runs. An equivalent RA is therefore much easier to work into other analytical methods, such as the Pythagorean formula.

As discussed in my previous post, there are a number of different logical paths to take in developing a component RA. The four that I use are the ones that I find most useful. None of them are unique to me--all of these constructions have been developed by others, and are simply being adapted here to specific categories and the BsR equation above. The point of this piece is to discuss the formulas and classify them, not to justify why I find them useful--as I said last time, that's a discussion that would require a separate post, one I don't feel like writing right now. The names I put on them are not intended to obscure that fact. The formulas all assume that one does not have actual opponent AB or PA; if you do, some of the estimation can be dispensed with.

I will be including all four estimators in my end of season stats next week, and giving them a separate post will reduce the clutter in the already wordy explanation that goes with the stats. 

1. Estimated Run Average (eRA)

eRA is the most straightforward flavor of component RA--a TRD estimator. There is very little manipulation of the BsR formula that goes into it:

A = H + W - HR
B = (2*TB - H - 4*HR + .05*W)*x (where x ~=.78)
C = AB - H = IP*y (where y ~= 2.82)
eRA = (A*B/(B+C) + HR)*9/IP

2. DIPS-style RA (dRA)

While the concept behind a DIPS RA has been well-known in sabermetrics thanks to Voros McCracken's work, it's still the most complex to calculate because one needs to reconstruct the pitching line, as innings pitched can no longer be treated as a constant. This is a TDD estimator.

Without using actual pitcher PA data, the first step is to estimate PA as IP*y + H + W (this estimate can be improved by spinning strikeouts off from innings, but I'm keeping it simple for this application). Then calculate the percentage of PA that result in walks, strikeouts, and home runs (I call these %W, %K, and %HR respectively). The percentage of balls in play (BIP%) is then just 1 - %W - %K - %HR. Multiplying BIP% by the league %H ((H-HR)/BIP) yields a new DIPS-estimate for %H.

Now everything is expressed as a rate with a common denominator of PA, and it's easy to plug in:

A = e%H + %W
B = (2*(z*e%H + 4*%HR) - e%H - 5*%HR + .05*%W)*x
C = 1 - e%H - %W - %HR
cRA = (A*B/(B + C) + %HR)/C*a

I've thrown in a couple more constants here; z is the league's average number of total bases per non-HR hit ((TB - 4*HR)/(H - HR)), generally around 1.28. a is the average number of AB-H per league game, generally around 25.2.

dRA is rate-based because it is easier to reconstruct the pitcher's line by just altering %H rather than constructing new estimates for hits and innings, which also change. There are equivalent approaches that would do just that.

3. Batted Ball class RA (cRA)

The obvious abbreviation here would be bRA--Dick Cramer used that once, but I'm going to pass. This is a TBD metric that considers batted ball types at their estimate values, without taking the SIERA route of giving line drives special treatment. tRA is the most famous metric that uses batted ball data in this way, but it uses linear weights and thus is classified as TBS. At least one other sabermetrician has already worked up a TBD metric of this type.

Admittedly, I am a little bit out of my element when working with the batted ball data. A metric like tRA is also a lot simpler than this one, because of the nature of using Base Runs. We cannot just take the linear weight for a groundball and calculate the needed BsR B weight to reproduce it for a given dataset, because we must be able to estimate baserunners, outs, and home runs from the batted ball types, rather than just needing the aggregate linear weight of the batted ball type.

This estimation, in my case at least, involves bringing in multiple sources of data--one which gives outcome (S, HR, etc.) frequencies for each batted ball type and another for the actual batted ball data for each pitcher. There are differences between data sources (STATS v. BIS v. Retrosheet) as well as other concerns about the uniformity of the data which makes this a significantly trickier task than estimating RA, and mean that formulas of this type will almost certainly need more upkeep over time.

Using the data from Colin Wyers here and the 2009 league totals of GB, FB, PU, and LD from Baseball Prospectus (here we have one of the data differences that I alluded to--Retrosheet v. BP's source) and reconciling to make sure that estimates of each hit type equals the actual 2009 totals, we get these equations:

cS = .057FB + .217GB + .516LD + .017PU
cD = .081FB + .018GB + .175LD + .004PU
cT = .0126FB + .001GB + .0155LD
cHR = .115FB + .024LD

We also need an estimate of outs, which I'll call cC because it plugs directly into the BsR equation:

cC = .6864FB + .7468GB + .2736LD + .9094PU + K

A = cS + cD + cT + W
B = (cS + 3*cD + 5*cT + 3*cHR + .05*W)*.742
cRA = (A*B/(B + cC) + cHR)*9/(cC*.3547)

4. SIERA-style RA (sRA)

SIERA is the new BP component ERA developed by Matt Swartz and Eric Seidman. I applied BsR in conjunction with what could loosely be described as a SIERA-style construction in this post, and so I won't go through each step again. Here I am presenting a different equation then in that post because this one is designed to estimate RA rather than ERA, but the logic is the same otherwise. Like cRA, this is a TBD metric--the difference is that batted balls are grouped in just two bins (grounders and non-grounders) rather than four:

nG = FB + LD + PU
G = GB
eS = .2167(G) + .2092(nG)
eD = .0184(G) + .1022(nG)
eT = .001(G) + .0119(nG)
eHR = .0701(nG)
A = eS + eD + eT + W
B = (eS + 3*eD + 5*eT + 3*eHR + .05*W)*.742
C = .76*G + .565*nG + K
sRA = (A*B/(B + C) + eHR)*9/(C*.355)

No comments:

Post a Comment

I reserve the right to reject any comment for any reason.