tag:blogger.com,1999:blog-12133335.post6148358460329295548..comments2015-01-26T05:43:03.818-05:00Comments on Walk Like a Sabermetrician: Pseudo-SIERA Using BsRphttp://www.blogger.com/profile/18057215403741682609noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-12133335.post-43693001366706281172010-02-18T09:29:03.333-05:002010-02-18T09:29:03.333-05:00Another little modification (one that has almost n...Another little modification (one that has almost no effect on RMSE) is using distinct fudge factors to estimate earned runs--one applied to B to make Base Runs equal runs allowed, and another to estimated runs allowed to make it equal earned runs. That results in:<br /><br />B = (2*(eS + 2*eD+ 3*eT + 4*eHR) - (eS + eD + eT + eHR) - 4*eHR + .05*W)*.768<br /><br />Pseudo-SIERA = (A*B/(B+C)+ eHR)*.927*9/(E/3)phttp://www.blogger.com/profile/18057215403741682609noreply@blogger.comtag:blogger.com,1999:blog-12133335.post-91658309154398373972010-02-17T17:40:34.734-05:002010-02-17T17:40:34.734-05:00While changes in event frequencies for types of ba...While changes in event frequencies for types of batted balls is quite possible, I don't think it's unreasonable to expect that they'd be stable enough to make an equation of this sort. Formulas like SIERA do not reset every year, and I think it stands to reason that the <br /><br />The bigger problem, as far as I can tell, is that Colin's data source and BP's data source are probably not the same, and so they have different criteria for what makes a line drive or a flyball, etc. I don't have the exact figures in front of me right now, but there was a noticeable difference in the distribution on batted ball types between Colin's data and the BP data.phttp://www.blogger.com/profile/18057215403741682609noreply@blogger.comtag:blogger.com,1999:blog-12133335.post-91259308625133293612010-02-17T15:56:28.667-05:002010-02-17T15:56:28.667-05:00I think the problem comes from the data Colin prov...I think the problem comes from the data Colin provided is from 2003-2008. I'm guessing you are testing using 2009 data which will have different outcomes for each batted ball type (HR being the biggest effect) compared to 2003-2008 which was much more offensive. Scaling all the outcomes to get the right numbers I guess is the best we can do if we don't have any other info.Bryan McCullochhttp://www.blogger.com/profile/11500723468497520744noreply@blogger.comtag:blogger.com,1999:blog-12133335.post-29884096494217924482010-02-17T09:18:23.606-05:002010-02-17T09:18:23.606-05:00Bryan, great catch. I somehow neglected to force ...Bryan, great catch. I somehow neglected to force home runs equal to the league total (again, this is admittedly cheating for the purposes of testing a formula on even terms against those calibrated on a different dataset). It should be:<br /><br />eHR = .0677*nG<br /><br />There are also not enough outs, so introduce a new term E:<br /><br />E = .8008*G + .6366*nG + K<br /><br />And Pseudo-SIERA to (I left out the times 9 needed to convert runs to an ERA, although I had it in my spreadsheet all along):<br /><br />(A*B/(B+C)+ eHR)*.914*9/(E/3)<br /><br />I believe this evens things out, although I certainly could have made another mistake. At this point the specific formula is even more of a jury-rigged atrocity than it was initially; I want to emphasize that this exercise is about the concept, not this specific implementation. This fix does lower the RMSE ever-so-slightly to 1.041.<br /><br />As you suggest, the best fix would be to use new hit type outcome database. I don't have the capacity to do that right now, so hopefully someone else will pick up the ball from here (and never fear, I know of at least two people who seem to be doing just that).phttp://www.blogger.com/profile/18057215403741682609noreply@blogger.comtag:blogger.com,1999:blog-12133335.post-78581904979712140462010-02-16T20:19:14.585-05:002010-02-16T20:19:14.585-05:00So I tried to do this with the same data you used....So I tried to do this with the same data you used. I'm finding that I'm overestimating the runs scored in the league by quite a bit. I think there are too many home runs but there are also too many hits as well. Ideas of quick fixes? I can always just use a heavy hand and normalize it. New hit type outcome data would probably fix this but that is harder (aka I don't know how to do that).Bryan McCullochhttp://www.blogger.com/profile/11500723468497520744noreply@blogger.comtag:blogger.com,1999:blog-12133335.post-34644504850213034572010-02-15T19:10:49.881-05:002010-02-15T19:10:49.881-05:00Michael, feel free to post your formulas here or p...Michael, feel free to post your formulas here or post a link to any article you might write about them.<br /><br />I figured this was what you were doing when you asked me about BsR--I started working on this Thurs morning but didn't get it finished until Sunday.phttp://www.blogger.com/profile/18057215403741682609noreply@blogger.comtag:blogger.com,1999:blog-12133335.post-25548921487128079812010-02-15T09:32:16.206-05:002010-02-15T09:32:16.206-05:00Patriot,
I talked to you about this a bit on Twit...Patriot,<br /><br />I talked to you about this a bit on Twitter (MarlinManiac here). I attempted to do this as well, calibrating B to the 2003-2007 ML data, since that was the timeline for the dataset Colin posted. I like the idea of splitting the batted ball data to grounders and nongrounders somewhat like SIERA, I think I'll tinker with that as well.Michaelhttp://www.blogger.com/profile/05993149554069073023noreply@blogger.com