Tuesday, February 25, 2020

Tripod: Runs Created

See the first paragraph of this post for an explanation of this series.

Bill James' Runs Created remains the most used run estimator, although there is no good reason for that being the case. It is odd that sabermetricians, generally a group inclined to fight preconceived notions and to not worship tradition for the heck of it, continue to use a method like RC.

Let me be clear: Bill James is my favorite author, he is the most influential and important(and one of the very best) sabermetricians of all-time. When he developed RC, it was just about as good as anything that anybody else had developed to estimate team runs scored and the thought process that went into developing it was great. But the field moves forward and it has left RC behind it.

I will now go into looking at the theory of RC and get back to explaining the alternative methods and the deficiencies of the method later. The basic structure of Runs Created is that runs are scored by first getting runners on base and then driving them in, all occurring within an opportunity space. Since getting runners on base and advancing them is an interactive process(if there is no one on base to drive in, all the advancement in the world will get you know where and getting runners on base but not driving them in will not score many runs either), the on base component and the advancement component are multiplied and divided by the opportunity component. A represents on base, B represents advancement, and C represents opportunity. The construct of RC is A*B/C.

No matter how many elements are introduced into the formula, it maintains the A*B/C structure. The first version of the formula, the basic version, is very straightforward. A = H+W, B = TB, and C = AB+W, or RC = (H+W)*TB/(AB+W). This simple formula is fairly accurate in predicting runs, with a RMSE in the neighborhood of 25(when I refer to accuracy right now I'm talking solely about predicting runs for normal major league teams).

The basic form of RC has several useful properties. The math simplifies so that it can be written as OBA*SLG*AB, which is also OBA*TB. Or if you define TB/PA as Total Base Average, you can write it as OBA*TBA*(AB+W). Also, RC/(AB-H), runs/out, is OBA*SLG/(1-BA).

The basic rate rewrite for RC is useful, (A/C)*(B/C)*C, which is easily seen to be A*B/C. If you call A/C modified OBA(MOBA) and B/C modified TBA(MTBA), you can write all versions of RC as MOBA*MTBA*C and as we will see, this will come in handy later.

James' next incarnation was to include SB and CS in the formula as they are fairly basic offensive stats. A became H+W-CS, B became TB+.7*SB, and C became AB+W+CS.

A couple years later (in the 1983 Abstract to be precise), James introduce an "advanced" version of the formula that included just about all of the official offensive statistics. This method was constructed using the same reasoning as the stolen base version. Baserunners lost are subtracted from the A factor, events like sacrifice flies that advance runners are credited in the B factor, and all plate appearances and extra outs consumed(like CS and DP) are counted as opportunity in the C factor.

A = H+W+HB-CS
B = TB+.65(SB+SH+SF)
C = AB+W+HB+SH+SF+CS+DP

In his 1984 book, though, James rolled out a new SB and technical version, citing their higher accuracy and structural problems in his previous formulas. The key structural problem was including outs like CS and DP in the C factor. This makes a CS too costly. As we will see later in calculating the linear weights, the value of a CS in the original SB version is -.475 runs(using the 1990 NL for the event frequencies). The revision cuts this to -.363 runs. That revision is:

A = H+W-CS
B = TB+.55*SB
C = AB+W

In addition to being more accurate and more logical, the new version is also simpler. The revision to the technical formula would stand as the state of RC for over ten years and was figured thusly:

A = H+W+HB-CS-DP
B = TB+.26(W+HB-IW)+.52(SB+SH+SF)
C = AB+W+HB+SH+SF

Additionally, walks are introduced into the B factor; obviously walks have advancement value, but including them in the basic version would have ruined the elegance of OBA*TB. With the added complexity of the new formula, James apparently saw no reason not to include walks in B.

The technical formula above is sometimes called TECH-1 because of a corresponding series of 14 technical RC formulas designed to give estimates for the majors since 1900.

Around 1997, James made additional changes to the formula, including strikeouts in the formula for the first time, introducing adjustments for performance in two "clutch" hitting situations, reconciling individual RC figures to equal team runs scored, and figuring individual RC within a "theoretical team" context. James also introduced 23 other formulas to cover all of major league history. The modern formula is also known as HDG-1(for Historical Data Group). The changes to the regular formula itself were quite minor and I will put them down without comment:

A = H+W+HB-CS-DP
B = TB+.24(W+HB-IW)+.5(SH+SF)+.62SB-.03K
C = AB+W+HB+SH+SF

Whether or not the clutch adjustments are appropriate is an ability v. value question. Value-wise, there is nothing wrong with taking clutch performance into account. James gives credit for hitting homers with men on base at a higher rate then for overall performance, and for batting average with runners in scoring position against overall batting average. The nature of these adjustments seems quite arbitrary to this observer--one run for each excess home run or hit. With all of the precision in the rest of the RC formula, hundredth place coefficients, you would think that there would be some more rigorous calculations to make the situational adjustments. These are added to the basic RC figure--except the basic RC no longer comes from A*B/C it comes from (A+2.4C)(B+3C)/(9C)-.9C(more on this in a moment). That figure is rounded to a whole number, the situational adjustments are added, then the figures for each hitter on the team are summed. This sum is divided into the team runs scored total to get the reconciliation factor, which is then multiplied by each individual's RC, which is once again rounded to a whole number to get the final Runs Created figure.

Quite a mouthful. Team reconciliation is another area that falls into the broad ability v. value decision. It is certainly appropriate in some cases and inappropriate in others. For Bill James' purpose of using the RC figures in a larger value method(Win Shares), in this observer's eyes they are perfectly appropriate. Whether they work or not is a question I'll touch on after explaining the theoretical team method.

The idea behind the theoretical team is to correct one of the most basic flaws of Runs Created, one that Bill James had noticed at least as early in 1985. In the context of introducing Paul Johnson's ERP, a linear method(although curiously it is an open question whether James noticed this at the time, as he railed against Pete Palmer's Batting Runs in the Historical Abstract), James wrote: "I've known for a little over a year that the runs created formula had a problem with players who combined high on-base percentages and high slugging percentages—-he is certainly correct about that—and at the time that I heard from him I was toying with options to correct these problems. The reasons that this happens is that the players' individual totals do not occur in an individual context...the increase in runs created that results from the extension of the one[on base or advancement ability] acting upon the extension of the other is not real; it is a flaw in the run created method, resulting from the player's offense being placed in an individual context."

The basic point is that RC is a method designed to estimate team runs scored. By putting a player's statistics in a method designed to estimate team runs scored, you are introducing problems. Each member of the team's offensive production interacts with the other eight players. But Jim Edmonds' offense does not interact with itself; it interacts with that of the entire team. A good offensive player like Edmonds, who has superior OBA and TBA, benefits by having them multiplied. But in actuality, his production should be considered within the context of the whole team. The team OBA with Edmonds added is much smaller then Edmonds' personal OBA, and the same for TBA.

So the solution(one which I am quite fond of and, following the lead of James, David Tate, Keith Woolner, and David Smyth among others have applied to Base Runs) that James uses is to add the player to a team of fairly average OBA and TBA, and to calculate the difference between the number of runs scored with the player and the runs scored without the player, and call this the player's Runs Created. This introduces the possibility of negative RC figures. This is one of those things that is difficult to explain but has some theoretical basis. Mathematically, negative RC must be possible in any linear run estimation method. It is beyond the scope of this review of Runs Created to get into this issue in depth.

The theoretical team is made up of eight players plus the player whose RC we are calculating. The A component of the team is (A+2.4C). This is the player's A, plus 2.4/8=.3 A/PA for the other players. Remember, A/PA is MOBA(and B/PA is MTBA). So the eight other players have a MOBA of .300. The B component of the team is (B+3C), so 3/8=.375 B/PA or a .375 MTBA for the remainder of the team. Each of the eight players has C number of plate appearances(or the player in question's actual PA), so the team has 9C plate appearances, and their RC estimate is (A+2.4C)(B+3C)/(9C). The team without the player has an A of 2.4C, a B of 3C, and a C of 8C, giving 2.4C*3C/8C=.9C runs created. Without adding the ninth player, the team will score .9C runs. So this is subtracted, and the difference is Runs Created.

James does not do this, but it is easy to change the subtracted value to give runs above average(just use nine players with MOBA .300 and MTBA .375, or adjust these values to the league or some other entity's norms, and then run them through the procedure above). Generally, we can write TT RC as:

(A+LgMOBA*C)(B+LgMTBA*C)/(9C)-LgMOBA*LgMTBA*8C(or 9C for average)

This step of the RC process is correct in my opinion, or at least justifiable. But one question that I do have for Mr. James is why always .300/.375? Why not have this value vary by the actual league averages, or some other criteria? It is true that slight changes in the range of major league MOBA and MTBA values will not have a large effect on the RC estimates, but if everything is going to be so precise, why not put precision in the TT step? If we are going to try to estimate how many runs Jim Edmonds created for the 2004 Cardinals, why not start the process by measuring how Jim Edmonds would effect a team with the exact offensive capabilities of the 2004 Cardinals? Then when you note the amount of precision(at least computationally if not logically) in Win Shares, you wonder even more. Sure, it is a small thing, but there are a lot of small things that are carefully corrected for in the Win Share method.

Just to illustrate the slight differences, let's take a player with a MOBA of .400 and a MTBA of .500 in 500 PA and calculate his TT RC in two situations. One is on the team James uses--.300/.375. His RC will be (.400*500+.300*500*8)(.500*500+.375*500*8)/(9*500)-.9*500, or 94.44. On a .350/.425 team(a large difference of 32% more runs/plate appearance), his RC figured analogously will be 98.33. A difference of less then four runs for a huge difference in teams. So while ignoring this probably does not cause any noticeable problems for either RC or WS estimates, it does seem a little inconsistent.

But while the TT procedure is mathematically correct and sabermetrically justifiable, it does not address the larger problem of RC construction. Neither does Bill's latest tweak to the formula, published in the 2005 Bill James Handbook. He cites declining accuracy of the original formula in the current high-home run era and proposes this new B factor:

B = 1.125S+1.69D+3.02T+3.73HR+.29(W-IW+HB)+.492(SB+SH+SF)-.04K

None of these changes corrects the most basic, most distorting flaw of Runs Created. That is its treatment of home runs. David Smyth developed Base Runs in the 1990s to correct this flaw. He actually tried to work with the RC form to develop BsR, but couldn't get it to work. So instead he came up with a different construct(A*B/(B+C)+D) that was still inspired by the idea of Runs Created. Once again, James' ideas have been an important building block for run estimation thinking. RC was fine in its time. But its accuracy has been surpassed and its structure has been improved upon.

A home run always produces at least one run, no matter what. In RC, a team with 1 HR and 100 outs will be projected to score 1*4/101 runs, a far cry from the one run that we know will score. And in an offensive context where no outs are made, all runners will eventually score, and each event, be it a walk, a single, a home run--any on base event at all--will be worth precisely one run. In a 1.000 OBA context, RC puts a HR at 1*4/1 = 4 runs. This flaw is painfully obvious at that kind of extreme point, but the distorting effects begin long before that. The end result is that RC is too optimistic for high OBA, high SLG teams and too pessimistic for low OBA, low SLG teams. The home run flaw is one of the reason why James proposed the new B factor in 2004--but that may cause more problems in other areas as we will see.

One way to evaluate Runs Created formulas is to see what kind of inherent linear weights they use. We know, based on empirical study, very good values for the linear weight of each offensive event. Using calculus, we can find precisely, for the statistics of any entity, the linear weights that any RC formula is using in that case. I'll skip the calculus, but for those who are interested, it involves partial derivatives.

LW = (C(Ab + Ba) - ABc)/C^2

Where A, B, and C are the total calculated A, B, and C factors for the entity in question, and a, b, and c are the coefficients for the event in question(single, walk, out, etc.) in the RC formula being used. This can be written as:

LW = (B/C)*a + (A/C)*b - (A/C)*(B/C)*c
= MTBA(a) + MOBA(b) - MOBA*MTBA*c

Take a team with a .350 MOBA and a .425 MTBA. For the basic RC formula, the coefficients for a single in the formula are a = 1, b = 1, c = 1, so the linear weight of a single is .425*1 + .350*1 - .425*.350*1 = .626 runs. Or a batting out, which is a = 0, b = 0, c = 1 is worth -.425*.350*1 = -.149 runs.

Let's use this approach with a fairly typical league(the 1990 NL) to generate the Linear Weight values given by three different RC constructs: basic, TECH-1, and the 2004 update.

Single: .558, .564, .598
Double: .879, .855, .763
Triple: 1.199, 1.146, 1.150
Home Run: 1.520, 1.437, 1.356
Walk/Hit Batter: .238, .348, .355
Intentional Walk: N/A, .273, .271
Steal: N/A, .151, .143
Caught Stealing: N/A, -.384, -.382
Sacrifice Hit: N/A, .039, .032
Sacrifice Fly: N/A, .039, .032
Double Play: N/A, -.384, -.382
Batting Out(AB-H): -.112, -.112, N/A
In Play Out(AB-H-K): N/A, N/A, -.111
Strikeout: N/A, N/A, -.123

Comparing these values to empirical LW formulas and other good linear formulas like ERP, we see, starting with the Basic version, that all of the hits are overemphasized while walks are severely underemphasized. The TECH-1 version brings the values of all hit types in line(EXCEPT singles), and fixes the walk problems. The values generated by TECH-1, with the glaring exception of the single, really aren't that bad. However, the 2004 version grossly understates the impact of extra base hits. I don't doubt James claim that it gives a lower RMSE for normal major league teams then the previous versions, but theoretically, it is a step backwards in my opinion.

You can use these linear values as a traditional linear weight equation if you want, but they are at odds in many cases with empirical weights and those generated through a similar process by BsR. One good thing is that Theoretical Team RC is equal to 1/9 times traditional RC plus 8/9 of linear RC. Traditional RC is the classic A*B/C construct, whereas the linear RC must be appropriate for the reference team used in the TT formula.

Tuesday, February 18, 2020

Tripod: Linear Weights

See the first paragraph of this post for an explanation of this series.

I certainly am no expert on Linear Weight formulas and their construction-leave that to people like Tango Tiger and Mickey Lichtman. However, I do have some knowledge on LW methods and thought I would explain some of the different methods of generating LW that are in use.

One thing to note before we start is that every RC method is LW. If you use the +1 technique, you can see the LWs that are used in a method like RC, BsR, or RPA. A good way to test non-linear RC formulas is to see how they stack up against LW methods in the context the LW are for. LW will vary widely based on the context. In normal ML contexts, though, the absolute out value is close to -.1, and the HR value stays close to 1.4. David Smyth provided the theory(or fact, I guess you could say), that as the OBA moves towards 1, the value of all events LWs converge towards 1.

Now what I understand of how LW are generated:

Empirical LW

Empirical LW have been published by Pete Palmer and Mickey Lichtman. They can be considered the true Linear Weight values. Empirical LW are based on finding the value of each event with the base/out table, and then averaging the value for all singles, etc. This is the LW for the single. Another way to look at it is that they calculate the value of an event in all 24 base/out situations, and then multiply that by the proportion of that event that occurs in that situation, and then sum those 24 values.

Palmer's weights were actually based on simulation, but as long as the simulation was well-designed it shouldn't be an issue. One way you could empirically derive different LW is to assume that the events occur randomly, i.e. assuming that the proportion of overall PAs in each base/out situation is the same as the proportion of the event that occur in this situation. For instance, if 2% of PA come with the bases loaded and 1 out, then you assume that 2% of doubles occur with the bases loaded and 1 out as well. This is an interesting idea for a method. If you see a double hit in a random situation, you could make the argument that this method would give you the best guess weight for this event. But that is only if you assume that the base/out situation does not effect the probability of a given event. Does it work out that way?

Tango Tiger told me that the only event that comes up with a significantly different LW value by the method I have just described is the walk. This is another way of saying that walks tend to occur in lower leverage situations then most events. But the difference is not that large.

Modeling

You can also use mathematical modeling to come up with LW. Tango Tiger and David Smyth have both published methods on FanHome.com that approach the problem from this direction. Both are approximations and are based on some assumptions that will vary slightly in different contexts. Tango, though, has apparently developed a new method that gives an accurate base/out table and LW based on mathematical modeling and does it quite well.
The original methods published by the two are very user-friendly and can be done quickly. Smyth also published a Quick and Dirty LW method that works well in normal scoring contexts and only uses the number of runs/game to estimate the value of events.

Skeletons

Another way to do this is to develop a skeleton that shows the relationships between the events, and then finds a multiplier to equate this to the actual runs scored. The advantage of this method is that you can focus on the long-term relationships between walks v. singles, doubles v. triples, etc, and then find a custom multiplier each season, by dividing runs by the result of the skeleton for the entity(league, team, etc.) you are interested in. Recently, I decided to take a skeleton approach of a LW method. Working with data for all teams, 1951-1998, I found that this skeleton worked well: TB+.5H+W-.3(AB-H), with a required multiplier of .324. Working SB and CS into the formula, I had: TB+.5H+W-.3(AB-H)+.7SB-CS, with an outward multiplier of .322. When I took a step back and looked at what I had done though, I realized I had reproduced Paul Johnson's Estimated Runs Produced method. If you look at Johnson's method:

(2*(TB+W)+H-.605*(AB-H))*.16

If you multiply my formula by 2, you get:

(2*(TB+W)+H-.6*(AB-H))*.162

As you can see, ERP is pretty much equal to my unnamed formula. Since it is so similar to ERP, I just will consider it to be ERP. You can then find the resulting LW by expanding the formula; for example, a double adds 2 total bases and 1 hit, so it has a value of (2*2+1)*.162=.81.

Working out the full expansion of my ERP equations, we have:

ERP = .49S+.81D+1.13T+1.46HR+.32W-.097(AB-H)
ERP = .48S+.81D+1.13T+1.45HR+.32W+.23SB-.32CS-.097(AB-H)

I have recently thrown together a couple of versions that encompass all of the official offensive stats:

ERP = (TB+.5H+W+HB-.5IW+.3SH+.7(SF+SB)-CS-.7DP-.3(AB-H))*.322
ERP = (TB+.5H+W+HB-.5IW+.3SH+.7(SF+SB)-CS-.7DP-.292(AB-H)-.031K)*.322

Or:

ERP = .483S+.805D+1.127T+1.449HR+.322(W+HB)-.161IW+.225(SB+SF-DP)+.097*SH-.322CS-.097(AB-H)
ERP = .483S+.805D+1.127T+1.449HR+.322(W+HB)-.161IW+.225(SB+SF-DP)+.097*SH-.322CS-.094(AB-H-K)-.104K

Here are a couple versions you can use for past eras of baseball. For the lively ball era, the basic skeleton of (TB+.5H+W-.3(AB-H)) works fine, just use a multiplier of .33 for the 1940s and .34 for the 1920s and 30s. For the dead ball era, you can use a skeleton of (TB+.5(H+SB)+W-.3(AB-H)) with a multiplier of .341 for the 1910s and .371 for 1901-1909. Past that, you're on your own. While breaking it down by decade is not exactly optimal, it is an easy way to group them. The formulas are reasonably accurate in the dead ball era, but not nearly as much as they are in the lively ball era.

Regression

Using the statistical method of multiple regression, you can find the most accurate linear weights possible for your dataset and inputs. However, when you base a method on regression, you often lose the theoretical accuracy of the method, since there is a relationship or correlation between various stats, like homers and strikeouts. Therefore, since teams that hit lots of homers usually strike out more than the average team, strikeouts may be evaluated as less negative then other outs by the formula, while they should have a slightly larger negative impact. Also, since there is no statistic available to measure baserunning skills, outside of SB, CS, and triples(for instance we dont know how many times a team gets 2 bases on a single), these statistics can have inflated value in a regression equation because of their relationship with speed. Another concern that some people have with regression equations is that they are based on teams, and they should not be applied to individuals. Anyway, if done properly, a regression equation can be a useful method for evaluating runs created. In their fine book, Curve Ball, Jim Albright and Jay Bennett published a regression equation for runs. They based it on runs/game, but I went ahead and calculated the long term absolute out value. With this modification, their formula is:

R = .52S+.66D+1.17T+1.49HR+.35W+.19SB-.11CS-.094(AB-H)

A discussion last summer on FanHome was very useful in providing some additional ideas about regression approaches(thanks to Alan Jordan especially). You can get very different coefficients for each event based on how you group them. For instance, I did a regression on all teams 1980-2003 using S, D, T, HR, W, SB, CS, and AB-H, and another regression using H, TB, W, SB, CS, and AB-H. Here are the results:

R = .52S+.74D+.95T+1.48HR+.33W+.24SB-.26CS-.104(AB-H)

The value for the triple is significantly lower then we would expect. But with the other dataset, we get:

R = .18H+.31TB+.34W+.22SB-.25CS-.103(AB-H)

which is equivalent to:

R = .49S+.80D+1.11T+1.42HR+.34W+.22SB-.25CS-.103(AB-H)

which are values more in line with what we would expect. So the way you group events(this can also be seen with things like taking HB and W together or separately. Or if there was a set relationship you wanted(like CS are twice as bad as SB are good), you could use a category like SB-2CS and regress against that) can make a large difference in the resulting formulas.

An example I posted on FanHome drives home the potential pitfalls in regression. I ran a few regression equations for individual 8 team leagues and found this one from the 1961 NL:

R = 0.669 S + 0.661 D - 1.28 T + 1.05 HR + 0.352 W - 0.0944 (AB-H)

Obviously an 8 team league is too small for a self-respecting statistician to use, but it serves the purpose here. A double is worth about the same as a single, and a triple is worth NEGATIVE runs. Why is this? Because the regression process does not know anything about baseball. It just looks at various correlations. In the 1961 NL, triples were correlated with runs at r=-.567. The Pirates led the league in triples but were 6th in runs. The Cubs were 2nd in T but 7th in runs. The Cards tied for 2nd in T but were 5th in runs. The Phillies were 4th in triples but last in runs. The Giants were last in the league in triples but led the league in runs. If you too knew nothing about baseball, you too could easily conclude that triples were a detriment to scoring runs.

While it is possible that people who hit triples were rarely driven in that year, it's fairly certain an empirical LW analysis from the PBP data would show a triple is worth somewhere around 1-1.15 runs as always. Even if such an effect did exist, there is likely far too much noise in the regression to use it to find such effects.

Trial and Error

This is not so much its own method as a combination of all of the others. Jim Furtado, in developing Extrapolated Runs, used Paul Johnson's ERP, regression, and some trial and error to find a method with the best accuracy. However, some of the weights look silly, like the fact that a double is only worth .22 more runs than a single. ERP gives .32, and Palmer's Batting Runs gives .31. So, in trying to find the highest accuracy, it seems as if the trial and error approach compromises theoretical accuracy, kind of as regression does.

Skeleton approaches, of course, use trial and error in many cases in developing the skeletons. The ERP formulas I publish here certainly used a healthy dose of trial and error.

The +1 Method/Partial Derivatives

Using a non-linear RC formula, you add one of each event and see what the difference in estimated runs would be. This will only give you accurate weights if you have a good method like BsR, but if you use a flawed method like RC, take the custom LWs with a grain of salt or three.

Using calculus, and taking the partial derivative of runs with respect to a given event, you can determine the precise LW values of each event according to a non-linear run estimator. See my BsR article for some examples of this technique.

Calculating the Out Value

You can calculate a custom out value for whatever entity you are looking at. There are three possible baselines: absolute runs, runs above average, and runs above replacement. The first step to find the out value for any of these is to find the sum of all the events in the formula other than AB-H. AB-H are called O for outs, and could include some other out events(like CS) that you want to have the value vary, but in my ERP formula it is just AB-H in the O component. Call this value X. Then, with actual runs being R, the necessary formulas are:

Absolute out value = (R-X)/O

Average out value = -X/O

For the replacement out value, there is another consideration. First you have to choose how you define replacement level, and calculate the number of runs your entity would score, given the same number of outs, but replacement level production. I set replacement level as 1 run below the entity's average, so I find the runs/out for a team 1 run/game below average, and multiply this by the entity's outs. This is Replacement Runs, or RR. Then you have:

Replacement out value = (R-RR-X)/O

Monday, February 17, 2020

All I Have to Say About the Astros

I have always been completely unable to relate to people are so weak-minded that they demand that history literally be rewritten to fit their own value judgments. If you want to judge/discount the accomplishments of the Astros, you have complete freedom of conscience to do so. Why do you need some authority figure to tell you how to think?

Of course, most of the people who demand asterisks and vacated games and forfeits and all of the other Stalinist trappings of the NCAA, IOC, and other contemptible organizations are already quite busy making their own value judgments about every damn thing in the entire world, thank you very much. If they just wanted some authority to tell them what to think, they would be sad, pathetic little creatures, worthy of the pity of free-thinking individuals and nothing more. But that's not what they want - they want some authority to tell me what to think. They seek to shift the burden of proof, as it were, from those who would deny the objective facts of reality to those who would uphold them.

I didn't call it "Stalinist" lightly - in a different cultural environment, the illiberal nature of the entire endeavor would be breathtaking in its audacity and its chutzpah. Instead, it is just another day in a world of creeping totalitarianism where the acceptable avenues of thought are controlled by the armed guards of some authority or the other. Historians of the future will learn much more about the America of 2020 from the response to the Astros scandal than they could ever hope to glean from the fact that it happened.

Thursday, February 13, 2020

Tripod: Clay Davenport's Equivalent Runs

See the first paragraph of this post for an explanation of this series. The content of this article is also the topic of better, more recent posts.

Equivalent Runs and Equivalent Average are offensive evaluation methods published by Clay Davenport of Baseball Prospectus. Equivalent Runs(EQR) is an estimator of runs created. Equivalent Average(EQA) is the rate stat companion. It is EQR/out transposed onto a batting average scale.

There seems to be a lot of misunderstanding about the EQR/EQA system. Although I am not the inventor of the system and don't claim to speak for Davenport, I can address some of the questions I have seen raised as an objective observer. The first thing to get out of the way is how Davenport adjusts his stats. Using Davenport Translations, or DTs, he converts everyone in organized baseball's stats to a common major league. All I know about DTs is that Davenport says that the player retains his value(EQA) after translating his raw stats (except, of course, that minor league stats are converted to Major League equivalents).

But the DTs are not the topic here; we want to know how the EQR formula works. So here are Clay's formulas, as given in the 1999 BP:

RAW = (H+TB+SB+1.5W)/(AB+W+CS+.33SB)
EQR(absolute) = (RAW/LgRAW)^2*PA*LgR/PA
EQR(marginal) = (2*RAW/LgRAW-1)*PA*LgR/PA
EQA =(.2*EQR/(AB-H+CS))^.4

where PA is AB+W

When I refer to various figures here, like what the league RAW was or what the RMSE of a formula was, it is based on data for all teams 1980-2000. Now, RAW is the basis of the whole method. It has a good correlation with runs scored, and is an odd formula that Davenport has said is based on what worked rather than on a theory.

Both the absolute and marginal EQR formulas lay out a relationship between RAW and runs. The absolute formula is designed to work for teams, where their offensive interaction compounds and increases scoring(thus the exponential function). The marginal formula is designed to estimate how much a player has added to the league(and is basically linear). Both formulas though, try to relate the Adjusted RAW(ARAW,RAW/LgRAW) to the Adjusted Runs/PA(aR/PA). This brings in one of the most misunderstood issues in EQR.

Many people have said that Davenport "cheated" by including LgRAW and LgR/PA in his formula. By doing this, they say, you reduce the potential error of the formula by honing it in to the league values, whereas a formula like Runs Created is estimating runs from scratch, without any knowledge of anything other than the team's basic stats. This is true to some extent, that if you are doing an accuracy test, EQR has an unfair advantage. But every formula was developed with empirical data as a guide, so they all have a built in consideration. To put EQR on a level playing field, just take a long term average for LgRAW and LgR/PA and plug that into the formula. For the 1980-2000 period we are testing, the LgRAW is .746 and the LgR/PA is .121. If we use these as constants, the accuracy test will be fair.

One of the largest(and most widely read) errors in this area is an accuracy test written up by Jim Furtado in the 1999 Big Bad Baseball Annual. Furtado tests EQR in both the ways prescribed by Davenport and the way he converts all rate stats to runs. Furtado takes RAW/LgRAW*LgR/O*O. He also does this for OPS, Total Average, and the like. Davenport railed against this test in the 2000 BP, and he was right to do so. First of all, most stats will have better accuracy if the comparison is based on R/PA, which is why Davenport uses R/PA in his EQR statistic in the first place. In all fairness to Furtado, though, he was just following the precedent set by Pete Palmer in The Hidden Game of Baseball, where he based the conversion of rate stats on innings batted, essentially outs/3. Unfortunately, Furtado did not emulate a good part of Palmer's test. Palmer used this equation to relate rate stats to runs:

Runs = (m*X/LgX+b)*IB*LgR/IB

Where X is the rate stat in question and IB is Innings Batted. m and b are, respectively, the slope and intercept of a linear regression relating the adjusted rate stat to the adjusted scoring rate. This is exactly what Davenport did; he uses m=2 and b=-1. Why is this necessary? Because the relationship between RAW and runs is not 1:1. For most stats the relationship isn't; OBA*SLG is the only one really, and that is the reason why it scores so high in the Furtado study. So Furtado finds RAW as worse than Slugging Average just because of this issue. The whole study is a joke, really-he finds OPS worse than SLG too! However, when EQR's accuracy comes up, people will invariably say, "Furtado found that..." It doesn't matter-the study is useless.

Now let's move on to a discussion of the Absolute EQR formula. It states that ARAW^2 = aR/PA, and uses this fact to estimate runs. How well does it estimate runs? In the period we are studying, RMSE = 23.80. For comparison, RC comes in at 24.80 and BsR is at 22.65. One thing that is suspicious about the formula is that the exponent is the simple 2. Could we get better results with a different exponent? We can determine the perfect exponent for a team by taking (log aR/PA)/(log ARAW). The median value for our teams is 1.91, and plugging that in gives a RMSE of 23.25.

In the BsR article, I describe how you can find linear values for a non-linear formula. Using the long term stats we used in the BsR article(1946-1995), this is the resulting equation for Absolute EQR:
.52S+.83D+1.14T+1.46HR+.36W+.24SB-.23CS-.113(AB-H)

Those weights are fairly reasonable, but unfortunately, the Absolute EQR formula isn't. We can demonstrate using BsR that as the OBA approaches 1, the run value of the offensive events converge around 1. We can see the flaw in Absolute EQR by finding the LW for Babe Ruth's best season, 1920:

EVENT BsR EQR

S .68 .74

D 1.00 1.28

T 1.32 1.82

HR 1.40 2.36

W .52 .47

O -.22 -.33

SB .24 .31

CS -.52 -.68

As you can see, absolute EQR overestimates the benefit of positive events and the cost of negative events. The reason for this is that the compounding effect in EQR is wrong. When a team has a lot of HR, it also means that runners are taken off base, reducing the potential impact of singles, etc. that follow. The Absolute EQR seems to assume that once a runner gets on base, he stays there for a while-thus the high value for the HR. Besides, the Absolute EQR formula is supposed to work better for teams, but the Marginal EQR formula has a RMSE of 23.23, better than Absolute EQR. So the entire Absolute EQR formula should be scrapped(incidentally, I haven't seen it in print since 1999, so it may have been).

The Marginal formula can also be improved. If we run a linear regression of ARAW to predict aR/PA for our sample, we get:

EQR=(1.9*ARAW-.9)*PA*LgR/PA, which improves the RMSE to 22.89.

Some misunderstanding has also been perpetuated about the linearity of Marginal EQR. Basically, Marginal EQR is technically not linear but it is very close to it. If the denominator for RAW was just PA, it would be linear because it would cancel out with the multiplication by PA. But since SB and CS are also included in the denominator, it isn't quite linear. However, since most players don't have high SB or CS totals, the difference is hard to see. So Marginal EQR is essentially linear. Some, myself included, would consider it a flaw to include SB and CS in the denominator. It would have been better, for linearity's sake, to put just PA in the denominator and everything else in the numerator. But Davenport apparently was looking to maximize accuracy, and it may be the best way to go for his goals. One possible solution would be to use the RAW denominator as the multiplier in place of PA, and multiply this by LgR/Denominator. However, I tried this, and the RMSE was 23.04. I'll publish the formula here: EQR = (1.92*RAW/LgRAW-.92)*(AB+W+CS+.33SB)*LgR/(AB+W+CS+.33SB)

Now, back to the material at hand, Davenport's EQR. If we find the linear weights for the marginal equation we get:

.52S +.84D+1.16T+1.48HR+.36W+.24SB-.23CS-.117(AB-H)

As was the case with the Absolute formula, I generated these weights through Davenport's actual formula, not my proposed modification using 1.9 and .9 rather than 2 and 1 for the slope and intercept. I wondered what difference this would make if any, so I tried it with my formula:

.50S+.80D+1.11T+1.41HR+.35W+.23SB-.22CS-.105(AB-H)

These values seem to be more in line with the "accepted" LW formulas. However, EQR does not seem to properly penalize the CS-it should be more harmful than the SB is helpful.

Finally, we are ready to discuss EQA. Most of the complaints about EQA are along the lines of taking an important value, like runs/out, and putting it on a scale(BA), which has no organic meaning. Also mentioned is that it dumbs people down. In trying to reach out to non-sabermetricians and give them standards that they understand easily, you fail to educate them about what is really important. Both of these arguments have merit. But ultimately, it is the inventor's call. You can convert between EQA and R/O, so if you don't like how Clay publishes it, you can convert it to R/O yourself. R/O = EQA^2.5*5.

Personally, I don't like EQA because it distorts the relationship between players:

PLAYER R/O EQA

A .2 .276

B . .3 .325

Player B has a R/O 1.5x that of player A, but his EQA is only 1.18x player Bs-the 2.5th root of 1.5.

But again, this is a quick thing you can change if you so desire, so I think it is wrong to criticize Davenport for his scale because it is his method.

Monday, February 10, 2020

Tripod: Appraised Runs

See the first paragraph of this post for an explanation of this series. The content of this article is also the topic of the more recent post here.

Mike Gimbel's stat, Run Production Average, is a very unique look at runs created, that, though published almost a decade ago, has gotten very little attention from other sabermetricians. RPA uses an initial RC formula based on what Gimbel calls Run Driving values, that underweight walks and overrate extra base hits. But Gimbel accounts for this with a set-up rating which evaluates the added impact of extra base hits in removing baserunners for the following batters. Gimbel's method has tested accuracy, with real teams, very similar to that of Runs Created, Base Runs, and Linear Weights. That it does not hold up at the extremes like Base Runs prevents it from being the best structure for RC we have, but it is an interesting alternative to RC. This is a compilation of some posts from FanHome on my knockoff of RPA, Appraised Runs. RPA uses some categories, like balks and wild pitches, that we do not have readily available. So I rigged up, following Gimbel's example, a similar formula using the typical categories. In doing so, I probably lost some of the true nature of Gimbel's creation. Gimbel obviously is the expert on his own stat, but hopefully AR is not too flawed to be useful in looking at the concept of RPA. This is Gimbel's RPA article.

Here is a compilation of posts from a thread on AR on FanHome. You can see the errors I made the first time, although I am not sure that the second version is much of an improvement.

Patriot - Dec 31, 2000

Mike Gimbel has a stat called Run Production Average. It is basically a R:PA method, but the way he gets runs is unlike any other construct I've seen. He starts by using a set of LW that reflect advancement values(or Run Driving), not total run scoring like most LW formulas. Than he adjusts half of this for the batter's Set Up ability, representing runners on base for the following batters. It is an interesting concept, but his formula has all sorts of variables that aren't available, like ROE, Balks, and WPs. So I tried to replicate his work.

As a starting point I used the Runs Produced formula laid out by Steve Mann in the 1994 Mann Fantasy Baseball Guide. The weights are a little high compared to other LW formulas, but oh well:

RP=.575S+.805D+1.035T+1.265HR+.345W-.115(AB-H)

Working with this formula, I saw that Gimbel's weights were similar to the (event value-walk value), and that Gimbel's walk value was similar to the (RP run value/2) The HR value seems to be kept.. This gives Run Driving, RD, as: .23S+.46D+.69T+1.265HR+.138W

The set-up values were similar to 1-Run Driving value, so the Set-Up Rating, which I'll call UP is (.77S+.54D+.31T-.265HR+.862W)/(AB-H) Gimbel used (AB+W) in the denominator, but outs works better.

Then Gimbel would take UP/LgUP*RD*.5+RD*.5, thus weighting half of the RD by the adjusted UP. But I found that UP correlated better with runs scored than RD, so we get:

AR = UP/LgUP*RD*.747+RD*.390

Where AR is Appraised Runs, the name I gave to this thing. LgUP can be constant @ .325 if you like it better.

Anyway, this had an AvgE in predicting team runs of 18.72, which is a little bit better than RC. So it appears as if Gimbel's work can be taken seriously as an alternative Run Production formula, like RC, LW, or BsR.

Please note that I am not endorsing this method. I'm just playing with it.

David Smyth - Jan 1, 2001

There is no doubt that Gimbel's method was ahead of its time, and that it can, properly updated, be as accurate as any other RC method. It has a unique advantage in being equally applicable to any entity (league, team, or individual), I think.

I support your effort to work on it a bit, and get rid of the odd categories he includes.

Basically what he was saying is that part of scoring is linear, and part is not. This is in between all-linear formulas such as XR, and all non-linear ones such as RC and BsR. The new RC is 89% linear and 11% non-linear, I recall. I'm not sure what the percentage is for RPA. As a team formula, it's certainly not perfect; at theoretical extremes it will break down. The only team formula I'm aware of which doesn't have that problem is BsR. There is probably a 'compromise' between RPA and BsR which would be great. IOW, you could probably use the fixed drive-in portion from RPA, and a modification of BsR for the non-linear part.

My position on these things is that both parts of the complete method--the run part and the win part--should be consistent with each other. For example, XR is linear and XW is non-linear. BsR is non-linear and BsW is linear. That bothers me, so I've chosen to go with a linear run estimator and BsW. Linear-linear. It's not so much a question of which is 'right'; it's a question of which frame of reference is preferable. If you want an individual frame of reference, go with RC or BsR, OWP, Off. W/L record, etc. If you want a team frame of reference, go with RPA or the new RC and XW. If you want a global (league or group of leagues) frame of reference, go with an XR-type formula and BsW. IMO, global has a simplicity and elegance which is unmatchable. Global would also include the Palmer/mgl LWts, using the -.30 type out value--another excellent choice.

There are also methods with enhanced accuracy such as Value Added Runs, and Base Production (Tuttle). These methods require tons of data. It's all a question of where to draw the line between accuracy, the amount of work, and what you're trying to measure. I tend to draw the line in favor of simplicity, because I've yet to be convinced that great complexity really pays off.

Patriot - Jan 2, 2001(clipped)

Anyway, since I have it here, this is the AR stolen base version:
RD = .23S+.46D+.69T+1.265HR+.138W+.092SB
UP = (.77S+.54D+.31T-.265HR+.862W+.092SB-.173CS)/(AB-H+CS)
AR = UP/LgUP*RD*.737+RD*.381
LgUP can be held constant @ .325

Patriot - Jun 13, 2001(clipped)

I have been working with this again, not because I endorse the construct or method but because the first time I did one amazingly crappy job.

For example, Ruth in 1920 has 205 RC, 191 BsR, and 167 RP. And 248 AR! Now, we don't know for sure how many runs Ruth would have created on his own, but anything that's 21% higher than RC makes me immediately suspicious.

Anyway, the problem comes from the UP term mostly. Gimbel used AB+W as the denominator and I used AB-H. Neither of us were right. Gimbel's method doesn't give enough penalty for outs, and mine overemphasizes out making to put too much emphasis on a high OBA. The solution is to subtract .115(that is the value from RP which I based everything on) times outs from the UP numerator because every out(or at least every third out) reduces the number of runners on base to zero.

Gimbel's RD values were also meant to estimate actual runs scored. So I applied a fudge factor to my RD to make it do the same. Anyway, this is the new Appraised Runs method:

RD = .262S+.523D+.785T+1.44HR+.157W
UP = (.77S+.54D+.31T-.265HR+.862W-.115(AB-H))/(AB+W)
AR = UP/AvgUP*RD*.5+RD*.5 AvgUP can be held @.145

This decreases the RMSE of the formula and also makes a better estimate IMO for extreme teams. Ruth now has 205 AR, more in line with the other estimators, although if you wanted to apply this method TT is the way to go.

The new AR stolen base version is:

RD = .262S+.523D+.785T+1.44HR+.157W+.079SB-.157CS

UP = (.77S+.54D+.31T-.265HR+.862W-.115(AB-H)+.262SB-CS)/(AB+W)

AR = UP/AvgUP*RD*.5+RD*.5 AvgUP can be held @ .140

Corrections - July 2002

I have had those Appraised Runs formulas for over a year now, and never bothered to check and see if they held up to the LW test. Here are the LW for AR from the +1 method for the long term ML stats(the display is S,D,T,HR,W,SB,CS,O): .52,.69,.86,1.28,.46,.19,-.57,-.106

You can see that we have some serious problems. The single, steal, and out are pegged pretty much perfectly. But extra base hits are definitely undervalued and the CS is wildly overvalued. So, I tried to revise the formula to improve these areas.

And I got nowhere. Eventually I scrapped everything I had, and went back to Gimbel's original values, and just corrected it for the fact that we didn't have some of his data. His RD portion worked fine, but I couldn't get his UP to work at all. Finally, I scrapped UP altogether. I decided instead to focus on the UP ratio(UP/AvgUP). This value is multiplied by half of the RD, and added to the other half of the RD to get AR. We'll call the UP/AvgUP ratio X. If you know RD, which I did based on Gimbel's work(I used his RD exactly except with a fudge factor to make it equate with runs scored, and dropping the events I didn't want/have), you have this equation:

R = RD*.5+RD*.5*X

Rearranging this equation to solve for X, you have:

X = R/(RD*.5)-1

So, with the actual X value for each team known, I set off to find a good way to estimate X. I didn't want to compare to the average anymore-if you think about it, it doesn't matter what the LgUP is, the number of baserunners on should depend only on the team's stats. So I did some regressions, found one that worked well, streamlined and edited the numbers, and wound up with these equations for AR:

RD1 = .289S+.408D+.697T+1.433HR+.164W

UP1 = (5.7S+8.6(D+T)+1.44HR+5W)/(AB+W)-.821

AR1 = UP*RD*.5 + RD*.5

RD2 = .288S+.407D+.694T+1.428HR+.164W+.099SB-.164CS

UP2 = (5.7S+8.6(D+T)+1.44HR+5W+1.5SB-3CS)/(AB+W)-.818

AR2 = UP*RD*.5 + RD*.5

These equations had RMSEs on the data for 1970-1989 of 22.64 and 21.79 respectively. For comparison, Basic RC was at 24.93 and Basic ERP was at 23.08, so the formulas are quite accurate when used for real teams. The linear values were: .51,.80,1.09,1.42,.35,.187,-.339,-.106

When applied to Babe Ruth, 1920, he had 205 AR, which is a reasonable value for an RC-like formula. Hopefully this new version of AR will turn out to be one that I can actually keep-maybe the third time is a charm.

Saturday, February 08, 2020

The Beals Goes On

Greg Beals is entering his tenth year at the helm of the OSU baseball program, having somehow parlayed a second run to the Big Ten Tournament title in three years into a three year contract extension. This despite his overall record over those nine seasons being the worst for the program in thirty years. At some point, it becomes an exercise in masochism even to repeat these facts. Greg Beals is apparently the coach for life.

This year, expectations are high. Baseball America ranked OSU #24 in their preseason Top 25, with only the forces of darkness joining them from the Big Ten at #8. Their explanation: “Ohio State brings its entire rotation back from that team and has a star behind the plate in Dingler”. That rotation will be fronted by redshirt sophomore lefty Seth Lonsway, who is likely to be a high draft pick come June. His strikeouts (12.3 per nine) outshone his overall performance (a good but not great +9 RAA) but offer the promise of an ace-level breakout. Fellow soph Garrett Burhenn was just as effective in 2019 (+10 RAA), but with a much more pedestrian strikeout rate. Junior lefty Griffan Smith was above average (+3 RAA) and should be a solid #3. It’s easy to see why this rotation - all of whom made at least fifteen starts and topped ninety innings – is highlighted as a strength.

The same can not be said for the bullpen, which is filled with significant question marks after Andrew Magno’s graduation. Last year, the weekday starters were by committee; only Jake Vance, now a senior, made more than three starts, and he only logged 41 innings over his 11 appearances/9 starts, and was not effective in doing so (7.90 RA). Sophomore Will Pfennig may be used as the relief ace, but also is a potential starter as he pitched 58 innings over 24 appearances in 2019.

Grad transfer lefty Patrick Murphy pitched sparingly during his time at Marshall, and in 18 innings last year allowed 7 runs with a troubling 11/15 K/W, but he Beals loves deploying lefty specialists and he may fit the bill. A couple of sophomore righties threw hard but didn’t know where it was going (Bayden Root with11.8 K/7.2 W over 35 innings and TJ Brock with 6.7/5.8 over 31) and their lefty classmate Mitch Milheim allowed 23 runs in as many innings (Milheim is another potential starter). Senior Joe Gahm only logged 19 innings; he was effective with a 4.26 RA but his peripherals tell a different story (6.63 eRA). He is one of only four returning Buckeye pitchers who had a RA better than the conference average in 2019 – the three starters are the others, which explains my concern about the bullpen. A cadre of freshman righties (Ethan Hammerberg, Cam Hubble, Tyler Kean, Wyatt Loncar, and Yianni Skeriotis) could be in the mix, and if there’s any justice in baseball than Ethan Hammerberg is a future lockdown closer.

Junior Dillon Dingler will handle the catching, and was Baseball America’s choice as preseason Big Ten Player of the Year. He did everything at the plate but hit for power last year (.291/.391/.424). His primary backup will be junior Brent Todys, who hit well enough last year to get at bats at DH where he is also penciled in as the starter for 2020 (.256/.345/.462). The four backstops on the rosters are all juniors as Dingler and Todys are joined by transfers Ronnie Allen and Archer Brookman.

Senior Conor Pohl is the incumbent at first, coming off a very consistent two year run of middling averages and power but solid walk rates (.279/.377/.393 in 2018 and .264/.350/.396 in 2019). Senior Matt Carpenter emerged from the bench as a Beals favorite as the second baseman, but his production (.257/.300/.324) left much to be desired. Sophomore Zach Dezenzo did an admirable job with an average offensive performance (.250/.316/.440) despite being stretched at shortstop due to an injury to now-senior Noah West. He will be counted on to be a middle of the order hitter for this squad. The aforementioned West is a solid fielder who was average at the plate in 86 PA before his injury, an improvement from his first two campaigns. Sophomore Nick Erwin will be a key backup; he struggled to a .235/.288/.272 line after being pressed into duty at the hot corner when Dezenzo slid over to short. Junior transfers Colton Bauer and Sam Wilson, sophomore Aaron Hughes, and freshman Avery Fisher round out the roster.

The outfield will have to be rebuilt as OSU’s top two offensive performers from 2019 (LF Brady Cherry and RF Dominic Canzone) are gone; they combined for a whopping 60 RAA. Also gone is center fielder Ridge Winand, although he will be easier to replace (-2 RAA). The only returning player with any significant experience is sophomore Nolan Clegg, who is penciled in to play right (.286/.348/.476 in 47 PA). The other spots are slated to go to freshman Mitchell Okuley (left) and Nate Karaffa (center), but there could be opportunities for a number of other players including juniors Jake Ruby and Scottie Seymour, redshirt freshman Alec Taylor, and true freshmen Joey Aden and Caden Kaiser.

OSU will open the season next weekend against lower-tier northern teams (St. Joe’s, Pitt, and Indiana State) in Port Charlotte, FL, then go to Georgia Tech and Lispcomb for true road series before facing Stetson, Harvard, and Fairfield at neutral sites and North Florida on the road. March 13 is the home opener at Bill Davis Stadium with a weekend series against Liberty, with the succeeding weekend opponents being Rutgers, @ Indiana, MSU, @ the forces of darkness, Illinois, The Citadel, @ Nebraska, Maryland, @ Northwestern. Mid-week opponents include Wright State (away), Bowling Green, Toledo, Morehead State, Dayton, Miami, Ohio University, Cincinnati (away), and Xavier (away).

Far be it from me to question Baseball America, but this does not look anything like a top 25 national team to me. The offense is not likely to be good; only Dingler and Dezenzo figure to be well above average performers, and the entire outfield is a question mark. The starting pitching is strong, but the depth behind it and the bullpen give less reason for optimism than the outfield, where at least hope can be placed on the shoulders of freshmen. Most of the non-weekend pitchers have already struggled; while we should expect a couple to take a step forward, they aren’t a blank slate on which to project hopes and dreams.

And there’s nothing in the world of Buckeye baseball further from such a blank slate than Greg Beals. Beals is what he is – a coach running a middle-tier Big Ten program in perpetuity, lucking his way to Big Ten Tournament titles that satiate his apathetic athletic director and even occasionally fool the wise folks at Baseball America. Coach for life despite having never won a conference title – it’s good work if you can get it, but it doesn’t make for a good fan experience.



Thursday, February 06, 2020

Tripod: Run Estimators & Accuracy

See the first paragraph of this post for an explanation of this series.

This page covers some run estimators. It by no means includes all of the run estimators, of which there are dozens. I may add some more descriptions at a later time. Anyway, Base Runs and Linear Weights are the most important and relevant. Equivalent Runs is often misunderstood. Appraised Runs is my twist on the funny looking, flawed, but no more so than Runs Created method of Mike Gimbel.

I guess I'll also use this page to make some general comments about run estimators that I may expand upon in the future. I posted these comments on Primer in response to an article by Chris Dial saying that we should use RC (or at least that it was ok as an accepted standard) and in which me mentioned something or the other about it being easy to understand for the average fan:

If you want a run statistic that the general public will understand, wouldn't it be better to have one that you can explain what the structure represents?

Any baseball fan should be able to understand that runs = baserunners *% of baserunners who score + home runs. Then you can explain that baserunners and home runs are known, and that we have to estimate % who score, and the estimate we have for it may not look pretty, but it's the best we've been able to do so far, and that we are still looking for a better estimator. So, you've given them:

1. an equation that they can understand and know to be true

2. an admission that we don't know everything

3. a better estimator than RC

And I think the "average" fan would have a much easier time understanding that the average value of a single is 1/2 a run, the average value of a walk is 1/3 of a run, the average value of an out is -1/10 of a run, then that complicated, fatally flawed, and complex RC equation. But to each his own I suppose.

I will also add that the statement that "all RC methods are right" is simply false IMO. It is true that there is room for different approaches. But, for instance, RC and BsR both purport to model team runs scored in a non-linear fashion. They can't both be equally right. The real answer is that neither of them are "right"; but one is more "right" than the other, and that is clearly BsR. But which is more right, BsR or LW? Depends on what you are trying to measure.

********

When I started this page, I didn't intend to include anything about the accuracy of the various methods other than mentioning it while discussing them. A RMSE test done on a large sample of normal major league teams really does not prove much. There are other concerns which are more important IMO such as whether or not the method works at the extremes, whether or not it is equally applicable to players as teams, etc. However, I am publishing this data in response to the continuing assertation I have seen from numerous people that BsR is more accurate at the extremes but less accurate with normal teams then other methods. I don't know where this idea got started, but it is prevelant with uninformed people apparently, so I wanted to present a resource where people could go and see the data disproving this for themselves.

I used the Lahman database for all teams 1961-2002, except 1981 and 1994 for obvious reasons. I tested 10 different RC methods, with the restricition that they use only AB, H, D, T, HR, W, SB, and CS, or stats that can be derived from those. This was for three reasons: one, I personally am not particularly interested in including SH, SF, DP, etc. in RC methods if I am not going to use them on a team; two, I am lazy and that data is not available and I didn't feel like compiling it; three, some of the methods don't have published versions that include all of the categories. As it is, each method is on a fair playing field, as all of them include all of the categories allowed in this test. Here are the formulas I tested:

RC: Bill James, (H+W-CS)*(TB+.55SB)/(AB+W)

BR: Pete Palmer, .47S+.78D+1.09T+1.4HR+.33W+.3SB-.6CS-.090(AB-H)
.090 was the proper absolute out value for the teams tested

ERP: originally Paul Johnson, version used in "Linear Weights" article on this site

XR: Jim Furtado, .5S+.72D+1.04T+1.44HR+.34W+.18SB-.32CS-.096(AB-H)

EQR: Clay Davenport, as explained in "Equivalent Runs" article on this site

EQRme: my modification of EQR, using 1.9 and -.9, explained in same article
For both EQR, the LgRAW for the sample was .732 and the LgR/PA was .117--these were held constant

BsR: David Smyth, version used published in "Base Runs" article on this site

UW: Phil Birnbaum, .46S+.8D+1.02T+1.4HR+.33W+.3SB-.5CS-(.687BA-1.188BA^2+.152ISO^2-1.288(WAB)(BA)-.049(BA)(ISO)+.271(BA)(ISO)(WAB)+.459WAB-.552WAB^2-.018)*(AB-H)
where WAB = W/AB

AR: based on Mike Gimbel concept, explained in "Appraised Runs" article on this site

Reg: multiple regression equation for the teams in the sample, .509S+.674D+1.167T+1.487HR+.335W+.211SB-.262CS-.0993(AB-H)

Earlier I said that all methods were on a level playing field. This is not exactly true. EQR and BR both take into account the actual runs scored data for the sample, but only to establish constants. BSR's B component should have this advantage too, but I chose not to so that the scales would not be tipped in favor of BsR, since the whole point is to demonstrate BsR's accuracy. Also remember that the BsR equation I used is probably not the most accurate that you could design, it is one that I have used for a couple years now and am familiar with. Obviously the Regression equation has a gigantic advantage.

Anyway, what are the RMSEs for each method?

Reg-------22.56
XR--------22.77
BsR-------22.93
AR--------23.08
EQRme-----23.12
ERP-------23.15
BR--------23.29
UW--------23.34
EQR-------23.74
RC--------25.44

Again, you should not use these figures as the absolute truth, because there are many other important factors to consider when choosing a run estimator. But the important things to recognize IMO are:

* all of the legitamite published formulas have very similar accuracy with real major league teams' seasonal data

* if accuracy on team seasonal data is your only concern, throw everything away and run a regression (the reluctance of people who claim to be totally concerned about seasonal accuracy to do this IMO displays that they aren't really as stuck on seasonal team accuracy as they claim to be)

* RC is way behind the other methods, although I think if it included W in the B factor as the Tech versions do it would be right in the midst of the pack

* BsR is just as accurate with actual team seasonal data as the other run estimators

Anyway, the spreadsheet is available here, and you can plug in other methods and see how they do. But here is the evidence; let the myths die.

Here are some other accuracy studies that you may want to look at. One is by John Jarvis. My only quibble with it is that he uses a regression to runs on each RC estimator, but it is a very interesting article that also applies the methods to defense as well, and is definitely worth reading (NOTE: sadly this link is dead)

And this is Jim Furtado's article as published in the 1999 BBBA. He uses both RMSE and regression techniques to evaluate the estimators. Just ignore his look at rate stats--it is fatally flawed by assuming there is a 1:1 relationship between rate stats and run scoring rate. That is pretty much true for OBAxSLG only and that is why it comes in so well in his survey.

Tuesday, January 28, 2020

Tripod: Common Fallacies

See the first paragraph of this post for an explanation of this series.

Here I deal with some misinformation that is sometimes spread about sabermetrics, or poorly designed statistical methods that are against sabermetric principles. The most important things to remember about sabermetrics are 1) that it is not the numbers themselves that matter, it is what the numbers mean and 2) the only thing that matter is wins, and the only things that lead to wins are runs and outs. Those two principles serve to explain most of the folly behind these fallacies.

The "Bases" Fallacy

There are many methods proposed, by many different people, that use bases and outs as the two main components. These include Boswell's Total Average, Offense Ratio, Codell's Base Out Percentage. There are others too, either looking at bases/out or bases/PA. Not all of the people who have designed these methods fall into the fallacy. Specifically I'll look at John McCarthy and his 1994 book from Betterway Books, Baseball's All-Time Dream Team.

McCarthy rates the great players of all time by what he calls the Earned Bases Average. EBA = (TB + W + SB - CS)/(AB + W). McCarthy mentions that he has read the sabermetric research, but that the sabermetric work is too difficult for the average fan to understand. He goes on to talk about how Linear Weights puts a HR as 3.15 times more valuable than a single, a triple 2.2, a double 1.7, and so on. He then says, "I believe that the value of a baseball game is more than just runs and winning. Winning is the player's aim, but there is also a transcendent beauty to great hits. It is that beauty that puts fans into the seats and visions of grandeur into kids' fantasies. A home can immeasurably lift the spirits of the team, or take the wind out of opponents. So I challenge a mathematical concept which devalues the extra bases earned by sluggers and speedsters."

Now, Mr. McCarthy may indeed have a point when he speaks of "grandeur" and stuff like that. It is OK if you want to design a method to measure the grandeur of players. Just don't get that confused with what actually wins baseball games. He later explains that the estimated values are not "tangible or real", and that "they are too complicated and many times are just clearly wrong." Sorry, buddy, it is you who are clearly wrong. A baseball game is not played in a vacuum. A player must interact with his teammates. The situations that occur by runners and outs effect the value of offensive events. Sure they are not always constant. That is why you must decide what you are measuring, be it ability or value, and choose value added runs or context neutral runs. But the fact is, a home run is not four times more valuable than a single. It just isn't. And a stolen base is clearly not as valuable as a single, because it advances just one baserunner by one base, whereas a single advances the hitter by one base, and advances most runners by at least one and sometimes two bases. Plus it gives an extra Plate Appearance to the team's offense. A stolen base does none of this.

The basic problem with McCarthy's thinking is that bases are not what matters. The game may be called baseball, but the winner is not the one with the most bases but the one with the most runs. You must relate everything to run scoring eventually if you want to really approximate its value. And TA and EBA and the like can be decent estimators of runs. But all bases are not created equal. A SB is worth always at least one base and a HR at least four. But a SB can only be worth one base and a HR can be worth as many as ten bases. The EBA concept is assuming that the only bases that matter are the one that individual genereates for himself, but again, no player is an island. Everything eventually comes down to runs and outs, not bases.

The Right-Handed Hitter Adjustment Fallacy

This is one that you can try to sneak by people. After all, sabermetricians seeming like to adjust for everything, whether or not it needs to be adjusted for, right? So, since there are more right-handed pitchers than southpaws, and righties hit worse versus righties, shouldn't they get credit for dealing with this disadvantage? No way, Jose.

Well, I suppose that if you want to measure literal ability, you want a right handed adjustment. But literal ability had nothing to do with winning baseball games. It has to do with batting practice and skills competitions, and jaw dropping, but not winning. Just as, because of the dynamics of baseball, not all bases are created equal, a lefty hitter is worth more than a righty of the same literal ability, assuming the normal left/right effect holds for them both. I view this extra credit for righties as tantamount to giving credit for ability to play the banjo. I mean, if I had a clone, the same as me in every way, except he could play the banjo and I couldn't, that would make him a more interesting guy than me, no? Sure. What does playing the banjo have to do with winning baseball games? About the same amount as being right-handed.

Seriously, being a right-handed hitter in baseball is a small handicap, just as being unable to hit home runs is a handicap, and having an 85 mph fastball is not as good as a 90 mph fastball. It is a great deal like if we gave Muggsy Bouges extra credit for being 5"5. That certainly hurts his stats, so why don't we adjust for it? Because it's a fact of life that these things are disadvantages, and the goal of baseball is to win games, not to look good.

Here is an example of a biased man who manipulates the numbers in this way. Giving Jim Rice 73% of his PAs vs. lefties is stupid, because 73% of the plate appearances pitched in baseball are not by lefty pitchers.

The Fallacy of the Ecological Fallacy

From time to time, someone who has a background in formal statistics will claim that applying various measures tested at the team-level to individual players(usually a run estimator) is falling prey to the Ecological Fallacy and is thus invalid.

Not having a formal statistics background, it may be hazardous to talk about something that I don’t fully understand. But I can tell you that to the extent that I understand the ecological fallacy, the idea that it applies to individual runs created estimates is hokum.

According to this link, the ecological fallacy occurs when “making an unsupported generalization from group data to individual behavior”. They then use an example of voting. One community has 25% who make over $100K a year, and 25% who vote Republican. Another has 75% who make over $100K and 75% who vote Republican. To use this data to conclude that there is a perfect correlation between individuals voting Republican and making over $100K would be the ecological fallacy. In fact, they show how the data could be distributed so that the correlation between individuals voting Republican and making over $100K is actually negative.

People will then go on to claim that since Runs Created methods are tested on teams, it is wrong to apply them to individuals and assume accuracy. It is true that multiplicative methods like Runs Created and Base Runs make assumptions about how runs are created that are true when applied to teams but cannot be applied to individuals(the well-documented problem of driving yourself in; Barry Bonds’ high on base factor interacts with his high advancement factor in RC, but in reality interacts with the production of his teammates). It is also true that regression equations have many potential pitfalls when applied to teams, let alone taking team regressions and applying them to individuals. However, these limitations are well known by most sabermetricians (although some stubbornly continue to use James’ RC for individual hitters).

The ecological fallacy claim, though, is extended by some to every run estimator that is verified against team data. The claim is that there “need not be little to no connection between team-level functions and player-level functions”. I also saw a critic point out once that run estimators did not do a good job of predicting individual runs scored.

My retort was that the low temperature today in Mozambique did not do a good job of predicting individual runs scored either. To assume that the team runs scored function and the individual runs scored function are the same is to be ignorant of the facts of baseball. A walk and a single have an equal run-scoring value for an individual, and a home run will always have an individual run-scoring value of 1. This is not true for a team, because, except in the case of the home run, it takes another player to come along and drive his teammate in. In the team case, all of these individuals stats are aggregated. The home run by one batter not only scores him, it scores any teammates on base. And therefore the act of scoring runs, for a team, incorporates advancement value as well. A single will create more runs, in average circumstances, then will a walk.

Therefore, when we have a formula that estimates runs scored for a team, it does not estimate the same function as runs scored for a player. It instead approximates another function that we choose to call “runs created” or “runs produced” or what have you. Now it could be claimed, I suppose, that the runs created function cannot be applied to individuals? But why not? If a double creates .8 runs for a team, and a hitter hits a double, why can’t we credit him with creating .8 of the team’s runs? All we are doing is assigning what we know are properly generated coefficients for the team to the player who actually delivered them. Or you can look at it, in the case of theoretical team RC, that we are isolating the player’s contribution by comparing team runs scored with him to team runs scored without him.

Furthermore, the individual runs created function and the team runs scored function are the same function. They have to be. Who causes the team to score runs, the tooth fairy? In the case of the voting situation which was said to be the ecological fallacy, you are artificially forming groups of people that don’t actually interact with each other. I can vote Republican, and you can vote Republican, but we’re not working together in that. You can vote Democrat and I can still vote Republican; our choices are independent. Then you make this group that voted Republican, and look at the their income, and yes, you can reach misleading conclusions.

The point I’m trying to make is that voting is not a community-level function, and therefore it is wrong to attribute the community level data pattern to individuals. People vote as individuals, not as communities. But scoring runs is a team-level function. People create runs as teams, each contributing. If we use a different voting analogy, that of the electoral college, people cast electoral votes as states. And therefore we can break down how much of the electoral vote of Montana that each citizen was responsible for(one share of however many if they voted for the winning candidate, zero if they did not). And that’s what we are doing by looking at individual runs created.

I think the problem, and I don’t mean this to apply to all statisticians who dabble in sabermetrics, but to some, particularly those who don’t have a strong traditional sabermetric background to go along with their statistical knowledge, is that they tend to take all of the things they know can often happen in statistical practice and apply them to sabermetrics, without seeing whether the conditions are in place. In the same way, they will use statistical methods like regression when they are not necessary. If you are studying phenomenon that you don’t have a good theory on, then regression can be a great tool. But if you are studying a baseball offense, you’re better off constructing a logical expression of the run scoring process like Base Runs or using the base/out table to construct Linear Weights. You don’t need a regression to ascertain the run values of events--baseball offenses are complex, but they are not nearly as complex as many of the other phenomenons in the world.


Explanation of Ecological Fallacy


Ec. Fallacy claim applied to RC