Tuesday, May 25, 2010

Meaningless Math, Starring Secondary Average

This post contains a number of regressions and basically is a whole lot of mathematical goofing around with batting average and secondary average. This is exactly the type of "analysis" that I would rail against if presented by someone else and offered with fervent enthusiasm. However, I agree that it can be fun to just play around with numbers--if you recognize that is the extent of the exercise. The point is to explore relationships between these two statistics and runs scored, not to propose new metrics or argue that they are superior to pre-existing metrics...because clearly, they're not.

With the disclaimer out of the way, let me define terms for the rest of this post. Batting average (BA) is obviously just H/AB; secondary average (SEC) is in this case figured just on the basis of hitting statistics, and is (TB - H + W)/AB.

The data used is team seasons, 1990-2005 (excluding 1994). Throughout the post, I have tested formulas and relationships against the same data from which they were derived. This is certainly a no-no, but I'm not concerned about the accuracy of the equations so much as the relative relationships.

I'll be relating BA and SEC to runs per at bat (R/AB), plate appearance (AB + W, R/PA), and outs (AB - H, R/O). How does each relate to runs scored in a linear equation? Allow a to denote "adjusted", or the ratio of a given statistic to the league average (in this case, I'll be treating the entire dataset as one "league"). The average BA is .265, SEC .250, R/PA .125, and R/O .187. The regression equations are:

aR/PA = 1.95*aBA - .95, RMSE = 44.3
aR/PA = .71*aSEC + .29, RMSE = 43.5
aR/O = 2.39*aBA - 1.38, RMSE = 49.8
aR/O = .85*aSEC + .15, RMSE = 50.8

This is not particularly helpful, but it does illustrate a couple points that are worth keeping in mind. The first is that both of these measures are woefully incomplete. BA, by ignoring the extra base contributions of hits and walks entirely, and SEC, by ignoring singles altogether, both miss important elements of offensive production.

Adjusted SEC has a positive intercept when used to estimate adjusted runs, which differentiates it from BA, OBA, SLG, and OPS. Those rates all have a more narrow percentage range (when compared to league average) than runs relative to the league average. Secondary average has a wider range, and so the estimated relative runs for a team deviates less from average than does their relative SEC.

We can also regress BA and SEC against runs together. Here are three such equations, using different denominators for runs scored:

R/AB = .639(BA) + .305(SEC) - .108, RMSE = 24.9
R/PA = .604(BA) + .237(SEC) - .094, RMSE = 24.9
R/O = 1.13(BA) + .416(SEC) - .215, RMSE = 25.2

Let's look at the R/AB relationship, which is nice because if we multiply by at bats to estimate runs, the BA and SEC denominators will cancel out and we'll be left with a pure linear weights equation:

est Runs = .639H + .305(EB + W) - .108AB ~= .53S + .84D + 1.14T + 1.45HR + .31W - .108(AB-H)

This equation is not that bad; it's a little high on all hits, but one could do a lot worse. Looking at the equation, you can see that it is essentially 2*BA + SEC times a constant, minus .108. (Actually, .639/.305 = 2.1)

Statements like "Stat X is twice as important as Stat Y" are always dangerous, because it's not exactly clear what that means. Does it mean that Stat X has twice the correlation with runs scored? Twice the r^2? Half the RMSE? Gets a weight of two (as BA does here) when combined with Stat Y to predict runs? Gets a weight of two when adjusted, then combined with adjusted Stat Y to predict runs? One needs look no further than the confusion over the quote attributed to Paul DePodesta in Moneyball on the relative value of OBA and SLG for an example of this.

However, if one goes with "Gets a weight of two when combined with Stat Y to predict runs" as the definition of "twice as important", then with respect to estimating R/AB, BA is twice as important as SEC. James wrote that "batting average is roughly twice as important as secondary average", so from this perspective, his statement was accurate (*). It is interesting to note that Clay Davenport used this statement to create "combined average", (2*BA + SEC)/3, which he eventually developed into his signature statistic, Equivalent Average.

I'm not going to get into whether some simple BA/SEC combination is better or worse than OPS and its derivatives. I don't think either family of metrics should be used widely in combination, because it's for those applications in which you'd use a combination of the two that you should be using wOBA, or EqA, or wRC+ (if you're sticking to the name-brands). However, the one nice thing about BA and SEC is that when you break them down, everything has the same denominator (AB).

It is tempting with either family of metrics to break them down into the three major components of hitting: base hits, walks, and extra bases on hits. BA and SEC allow one to do so while keeping at bats as the denominator, with the three components being BA, Isolated Power, and walks/at bat. Walks/at bat is not optimal since the denominator does not include the numerator quantity, it is directly relatable to walks/PA (assuming HB and sacrifices are excluded).

The same cannot be said for OPS, which has OBA-BA as its walk component. While this metric is sometimes used and called "isolated discipline" or something similar, mathematically it is equal to (walks/PA)*(1 - BA), which is not particularly logical for use as a measure of walk frequency.

Tying that all together, in case you ever want to figure SEC from less than complete data, here is the math tying SEC to BA, OBA, and SLG (again, leaving stolen base attempts out, and also leaving HB and sacrifices out of OBA). You can see that SEC can be rewritten as (TB - H)/AB + W/AB, which is equal to isolated power (SLG - BA) plus walks per at bat. Walks per at bat is equal (OBA - BA)/(1 - OBA); to get walks per PA, use (OBA - BA)/(1 - BA). So Secondary Average can be figured directly from BA, OBA, and SLG as:

SEC = SLG - BA + (OBA - BA)/(1 - OBA)

(*) I have argued before that when evaluating a rate stat that will be used on its own to measure total production, with no alteration of the scale or conversion into runs, that the important relationship is that of the rate stat with runs/out. The argument goes that since runs/out is our standard choice as an overall offensive rate (certainly correct for teams, we often use it for convenience with players as well), any substitute rate of overall offensive productivity should be a stand-in for it. The 2:1 BA/SEC weighting implied by the regression is only for the R/AB relationship; the R/PA and R/O regressions are tilted more heavily towards BA.

No comments:

Post a Comment

Comments are moderated, so there will be a lag between your post and it actually appearing. I reserve the right to reject any comment for any reason.