Sunday, February 14, 2010

Pseudo-SIERA Using BsR

There has been a lot of chatter in the sabermetric community about SIERA, the new DIPS-style ERA estimator using batted ball data developed by Eric Seidman and Matt Swartz for Baseball Prospectus. Before even commenting on the metric itself, I think it is worthwhile to applaud BP for their recent hires (including Swartz and Seidman, but also Tommy Bennett and Colin Wyers) that have reinvigorated the sabermetric aspect of their operation--and for their openness in sharing the development of SIERA in a five-part series.

SIERA is innovative in that it does not consider line drives (except to the extent that they also represent a plate appearance), citing the low correlation in year-to-year line drive frequencies; it considers groundball rate with a denominator of plate appearances rather than balls in play; it treats flyballs and pop-ups equally and as offsets to grounders in a (GB - FB - PU)/PA term; and it is based on a regression equation with quadratic and interactive terms.

It is the latter property of the metric which makes someone who approaches sabermetrics from my particular viewpoint and biases blanch a bit. My personal reaction to any sort of "ugly" regression equation is to wonder if there is a way to accomplish the same objective with similar accuracy while using a more intuitive (to me) model. Naturally, whenever run estimation is involved, that leads me to Base Runs, which is the most intuitive simple model of team run scoring (IMO--of course, this blog is always just my opinion and nothing more, but I'm being extra careful today).

So I quickly threw together a Base Run estimator, using only batted ball types (GB, FB, LD, and PU), walks, and strikeouts. I further divided batted balls into just two categories--groundballs (G) and non-grounders (nG = FB + PU + LD). This is not exactly the same as what SIERA does by using (GB - FB - PU)/PA, but it does embrace the not-so-fine breakdown of batted ball types that SIERA introduced.

I then estimated singles, doubles, triples, homers, and outs from G, nG, and K. To do this, I used the data published by Colin Wyers here on event rates by batted ball types , and figured a weighted average for non-grounders by using the GB, FB, PU, and LD data published by BP. This posed a bit of a problem, as Colin used Retrosheet definitions to derive his figures while BP uses a different data source; such is the nature of working with batted ball data without a DIY-approach.

From there, I ensured that estimated singles equaled actual singles for the 2009 major leagues (and doubles, etc.) and simply plugged everything into BsR:

eS = .2167(G) + .2092(nG)
eD = .0184(G) + .1022(nG)
eT = .001(G) + .0119(nG)
eHR = .0701(nG)
A = eS + eD + eT + W
B = (2*(eS + 2*eD+ 3*eT + 4*eHR) - (eS + eD + eT + eHR) - 4*eHR + .05*W)*.78
C = .7636(G) + .607(nG) + K
Pseudo-SIERA = (A*B/(B+C) + eHR)*.904/(C/3)

NOTE: As commentator Bryan points out, this formula is riddled with errors. I'll leave it here as to not hide my stupidity, but please see comment #4 for a corrected set of equations.

It is imperative to point out that what I did was clearly cheating from the standpoint of testing a formula--I calibrated it against the very dataset against which I am going to test it, rendering the test essentially useless. I did this simply for ease--my point in this entire exercise is not to supplant SIERA but just to raise the question of whether a different model can incorporate Seidman and Swartz's findings.

There is also a great deal of phony precision (i.e. ten-thousandths place decimals) on display, and a great deal of simplification that could be done to the terms (everything could be written in just four variables--G, nG, W, and K, but unless one is actually intending to use the formula on a regular basis doing so is a waste of time). Also, a better BsR B factor could have been used--I stuck with one I've used forever despite being less than optimal.

I then applied Pseudo-SIERA to all pitchers with >= 200 PA in 2009. The RMSE in estimating ERA for the real SIERA was 1.00; for Pseudo-SIERA, 1.06. Of course, as I already made clear, Pseudo-SIERA had the advantage of being calibrated specifically on this dataset.

Of course, the more telling test of a SIERA-type metric is how it does at predicting future ERA, something that I have obviously not tested here. There's really no need, at least with this implementation of the pseudo formula--I have no expectation that it would outperform SIERA. My goal here was just to incorporate a couple of ideas from Seidman and Swartz's work into a BsR model, and demonstrate that such a model has potential to be used in conjunction with those ideas. Nothing more.

On a final, unrelated note, I posted the first guest scoresheet contribution to Weekly Scoresheet yesterday. I have two more ready to go, and hopefully there will be more to come. I realized shortly after sending out my initial request for scoresheets that I should have waited until the season was underway and people would be more likely to have scoring on the brain (and scoresheets sitting around where they could be easily procured). Oh well.

7 comments:

  1. Patriot,

    I talked to you about this a bit on Twitter (MarlinManiac here). I attempted to do this as well, calibrating B to the 2003-2007 ML data, since that was the timeline for the dataset Colin posted. I like the idea of splitting the batted ball data to grounders and nongrounders somewhat like SIERA, I think I'll tinker with that as well.

    ReplyDelete
  2. Michael, feel free to post your formulas here or post a link to any article you might write about them.

    I figured this was what you were doing when you asked me about BsR--I started working on this Thurs morning but didn't get it finished until Sunday.

    ReplyDelete
  3. So I tried to do this with the same data you used. I'm finding that I'm overestimating the runs scored in the league by quite a bit. I think there are too many home runs but there are also too many hits as well. Ideas of quick fixes? I can always just use a heavy hand and normalize it. New hit type outcome data would probably fix this but that is harder (aka I don't know how to do that).

    ReplyDelete
  4. Bryan, great catch. I somehow neglected to force home runs equal to the league total (again, this is admittedly cheating for the purposes of testing a formula on even terms against those calibrated on a different dataset). It should be:

    eHR = .0677*nG

    There are also not enough outs, so introduce a new term E:

    E = .8008*G + .6366*nG + K

    And Pseudo-SIERA to (I left out the times 9 needed to convert runs to an ERA, although I had it in my spreadsheet all along):

    (A*B/(B+C)+ eHR)*.914*9/(E/3)

    I believe this evens things out, although I certainly could have made another mistake. At this point the specific formula is even more of a jury-rigged atrocity than it was initially; I want to emphasize that this exercise is about the concept, not this specific implementation. This fix does lower the RMSE ever-so-slightly to 1.041.

    As you suggest, the best fix would be to use new hit type outcome database. I don't have the capacity to do that right now, so hopefully someone else will pick up the ball from here (and never fear, I know of at least two people who seem to be doing just that).

    ReplyDelete
  5. I think the problem comes from the data Colin provided is from 2003-2008. I'm guessing you are testing using 2009 data which will have different outcomes for each batted ball type (HR being the biggest effect) compared to 2003-2008 which was much more offensive. Scaling all the outcomes to get the right numbers I guess is the best we can do if we don't have any other info.

    ReplyDelete
  6. While changes in event frequencies for types of batted balls is quite possible, I don't think it's unreasonable to expect that they'd be stable enough to make an equation of this sort. Formulas like SIERA do not reset every year, and I think it stands to reason that the

    The bigger problem, as far as I can tell, is that Colin's data source and BP's data source are probably not the same, and so they have different criteria for what makes a line drive or a flyball, etc. I don't have the exact figures in front of me right now, but there was a noticeable difference in the distribution on batted ball types between Colin's data and the BP data.

    ReplyDelete
  7. Another little modification (one that has almost no effect on RMSE) is using distinct fudge factors to estimate earned runs--one applied to B to make Base Runs equal runs allowed, and another to estimated runs allowed to make it equal earned runs. That results in:

    B = (2*(eS + 2*eD+ 3*eT + 4*eHR) - (eS + eD + eT + eHR) - 4*eHR + .05*W)*.768

    Pseudo-SIERA = (A*B/(B+C)+ eHR)*.927*9/(E/3)

    ReplyDelete

I reserve the right to reject any comment for any reason.