Sunday, May 30, 2010

Tuesday, May 25, 2010

Meaningless Math, Starring Secondary Average

This post contains a number of regressions and basically is a whole lot of mathematical goofing around with batting average and secondary average. This is exactly the type of "analysis" that I would rail against if presented by someone else and offered with fervent enthusiasm. However, I agree that it can be fun to just play around with numbers--if you recognize that is the extent of the exercise. The point is to explore relationships between these two statistics and runs scored, not to propose new metrics or argue that they are superior to pre-existing metrics...because clearly, they're not.

With the disclaimer out of the way, let me define terms for the rest of this post. Batting average (BA) is obviously just H/AB; secondary average (SEC) is in this case figured just on the basis of hitting statistics, and is (TB - H + W)/AB.

The data used is team seasons, 1990-2005 (excluding 1994). Throughout the post, I have tested formulas and relationships against the same data from which they were derived. This is certainly a no-no, but I'm not concerned about the accuracy of the equations so much as the relative relationships.

I'll be relating BA and SEC to runs per at bat (R/AB), plate appearance (AB + W, R/PA), and outs (AB - H, R/O). How does each relate to runs scored in a linear equation? Allow a to denote "adjusted", or the ratio of a given statistic to the league average (in this case, I'll be treating the entire dataset as one "league"). The average BA is .265, SEC .250, R/PA .125, and R/O .187. The regression equations are:

aR/PA = 1.95*aBA - .95, RMSE = 44.3
aR/PA = .71*aSEC + .29, RMSE = 43.5
aR/O = 2.39*aBA - 1.38, RMSE = 49.8
aR/O = .85*aSEC + .15, RMSE = 50.8

This is not particularly helpful, but it does illustrate a couple points that are worth keeping in mind. The first is that both of these measures are woefully incomplete. BA, by ignoring the extra base contributions of hits and walks entirely, and SEC, by ignoring singles altogether, both miss important elements of offensive production.

Adjusted SEC has a positive intercept when used to estimate adjusted runs, which differentiates it from BA, OBA, SLG, and OPS. Those rates all have a more narrow percentage range (when compared to league average) than runs relative to the league average. Secondary average has a wider range, and so the estimated relative runs for a team deviates less from average than does their relative SEC.

We can also regress BA and SEC against runs together. Here are three such equations, using different denominators for runs scored:

R/AB = .639(BA) + .305(SEC) - .108, RMSE = 24.9
R/PA = .604(BA) + .237(SEC) - .094, RMSE = 24.9
R/O = 1.13(BA) + .416(SEC) - .215, RMSE = 25.2

Let's look at the R/AB relationship, which is nice because if we multiply by at bats to estimate runs, the BA and SEC denominators will cancel out and we'll be left with a pure linear weights equation:

est Runs = .639H + .305(EB + W) - .108AB ~= .53S + .84D + 1.14T + 1.45HR + .31W - .108(AB-H)

This equation is not that bad; it's a little high on all hits, but one could do a lot worse. Looking at the equation, you can see that it is essentially 2*BA + SEC times a constant, minus .108. (Actually, .639/.305 = 2.1)

Statements like "Stat X is twice as important as Stat Y" are always dangerous, because it's not exactly clear what that means. Does it mean that Stat X has twice the correlation with runs scored? Twice the r^2? Half the RMSE? Gets a weight of two (as BA does here) when combined with Stat Y to predict runs? Gets a weight of two when adjusted, then combined with adjusted Stat Y to predict runs? One needs look no further than the confusion over the quote attributed to Paul DePodesta in Moneyball on the relative value of OBA and SLG for an example of this.

However, if one goes with "Gets a weight of two when combined with Stat Y to predict runs" as the definition of "twice as important", then with respect to estimating R/AB, BA is twice as important as SEC. James wrote that "batting average is roughly twice as important as secondary average", so from this perspective, his statement was accurate (*). It is interesting to note that Clay Davenport used this statement to create "combined average", (2*BA + SEC)/3, which he eventually developed into his signature statistic, Equivalent Average.

I'm not going to get into whether some simple BA/SEC combination is better or worse than OPS and its derivatives. I don't think either family of metrics should be used widely in combination, because it's for those applications in which you'd use a combination of the two that you should be using wOBA, or EqA, or wRC+ (if you're sticking to the name-brands). However, the one nice thing about BA and SEC is that when you break them down, everything has the same denominator (AB).

It is tempting with either family of metrics to break them down into the three major components of hitting: base hits, walks, and extra bases on hits. BA and SEC allow one to do so while keeping at bats as the denominator, with the three components being BA, Isolated Power, and walks/at bat. Walks/at bat is not optimal since the denominator does not include the numerator quantity, it is directly relatable to walks/PA (assuming HB and sacrifices are excluded).

The same cannot be said for OPS, which has OBA-BA as its walk component. While this metric is sometimes used and called "isolated discipline" or something similar, mathematically it is equal to (walks/PA)*(1 - BA), which is not particularly logical for use as a measure of walk frequency.

Tying that all together, in case you ever want to figure SEC from less than complete data, here is the math tying SEC to BA, OBA, and SLG (again, leaving stolen base attempts out, and also leaving HB and sacrifices out of OBA). You can see that SEC can be rewritten as (TB - H)/AB + W/AB, which is equal to isolated power (SLG - BA) plus walks per at bat. Walks per at bat is equal (OBA - BA)/(1 - OBA); to get walks per PA, use (OBA - BA)/(1 - BA). So Secondary Average can be figured directly from BA, OBA, and SLG as:

SEC = SLG - BA + (OBA - BA)/(1 - OBA)

(*) I have argued before that when evaluating a rate stat that will be used on its own to measure total production, with no alteration of the scale or conversion into runs, that the important relationship is that of the rate stat with runs/out. The argument goes that since runs/out is our standard choice as an overall offensive rate (certainly correct for teams, we often use it for convenience with players as well), any substitute rate of overall offensive productivity should be a stand-in for it. The 2:1 BA/SEC weighting implied by the regression is only for the R/AB relationship; the R/PA and R/O regressions are tilted more heavily towards BA.

Thursday, May 20, 2010

The All-LOST Surname All-Stars

C: Earl Smith (Libby)
1B: Jamie Burke (Juliet)
2B: Tony Fernandez (Nikki)
3B: Jimmy Austin (Kate, cheating as her name is Austen)
SS: Jose Reyes (Hurley)
OF: Rickey Henderson (Rose)
OF: Andre Dawson (Michael)
OF: Buddy Lewis (Charlotte)

P: Whitey Ford (Sawyer)
P: Spud Chandler (Cindy)
P: Graeme Lloyd (Walt)
P: Buddy Carlyle (Boone)
P: David Cortes (Ana-Lucia)
P: Tom Hume (Desmond)
P: Wes Littleton (Claire)
P: Johnny Rutherford (Shannon)
P: Keith Shepherd (Jack, cheating as his name is Shephard)

Monday, May 17, 2010


* One of my weaknesses a general baseball fan is the difficulty I have in divorcing my outlook towards a player or team from the coverage they get. The most common manifestation is when a player is significantly "overrated" by mainstream media and fans.

I put the scare quotes around overrated because it can be a dangerous word--in order to know whether someone is overrated, you first have to agree on a rating scheme and then procure two credible sets of ratings to compare. There are few opportunities to do so, as it's rare for there to be any formal compilation of opinions on players. The BBWAA awards and the Hall of Fame vote are two exceptions, but they deal only with a select group of players and solely represent the opinions of writers, not the public at large.

So there is a great deal of imprecision inherent in labeling a player or team as overrated, even aside from the uncertainly present in one's own ratings. Thus, the determination that a player is overrated usually takes the form of a gut reaction.

In any event, I'd suggest that Ryan Howard is an overrated player. Ryan Howard is clearly a good player, but he also happens to be one who has had the fortune of playing on a good team and excelling in categories that mainstream folks still value disproportionately to sabermetricians. Of course, this is not Ryan Howard's fault. Ryan Howard is not responsible for the RBI fetish, Ryan Howard is not responsible for the tendency to lionize players as winners, and of course he has no control over what anyone writes about him. He's not responsible for the fact that his propensity to strike out is ignored while it becomes a major point of discussion for others.

As such, it is irrational to hold the media and public perception of Ryan Howard against him. But this is where I struggle. I find it hard to read Ryan Howard described as "the Babe Ruth of his generation at the plate" and not resent him for it.

Another example is Derek Jeter. I don't feel bad about disliking Jeter; I'd dislike him anyway due to his college sports sympathies. He's clearly a great player, but he's been built up into something more in certain quarters--the living embodiment of everything that a baseball player is supposed to be, as Pete Rose and Cal Ripken have been in the past.

Intellectually, I should be able to get over it, and not let silly hyperbole and shoddy analysis influence who I root for and my feeling towards players. It's a struggle.

* I have never played baseball on a competitive level, let alone professionally, so I am not really qualified to talk about unwritten rules or anything else that takes place between the lines. That's never stopped anyone, before, though, and I have some thoughts on the A-Rod/Braden incident that aren't particularly well-articulated (or novel or interesting or...) but that I feel compelled to write up anyway.

Baseball, like many other institutions, has built up layers of custom and tradition over time. Some of it made sense when it was established; some of it still makes sense today. However, much of it is bizarre, contradictory, and silly. I have always applauded people who flaunt the silly stuff, who don't allow themselves to become slaves to a code of conduct that doesn't even exist.

Rodriguez has consistently demonstrated that he has no regard for the imaginary rules. He has broken them at least three times that I can recall off the top of my head. The common thread between the two previous instances is that they both demonstrated a desire to win, method be damned. Yelling as a Blue Jay waited to catch the popup speaks for itself, but more interesting was his playoff glove slap.

The play was hopeless, so why not take a slap at the glove and see if you can get away with it? This kind of behavior is tolerated, even celebrated in other sports. You're an offensive tackle about to have Reggie White blow past? Take him down. A cornerback beaten deep? Same thing. Someone is driving for an easy layup or dunk? Give them a hard foul, make them earn it at the line. Not doing these things, giving up and allowing the inevitable to occur, is the course of action that is considered deviant.

Some people taking the anti-ARod side have pointed out that because he is one of the greats of the game, he has the ability to push things further than lesser players. This is undoubtedly true. I contend that it is also true that since A-Rod is one of the most despised players by the media, he gets a lot more criticism for anything he does than a player who acts similarly but is a certified red-ass. When Pete Rose flattened Ray Fosse in an All-Star game, he had scores of defenders. "That's the right way to play the game, exhibition or not." Presumably there's some sort of unwritten rule against that sort of thing, but Pete Rose just wanted to win at all costs. That's admirable, don't you see?

Added later: Of course, Braden then had the sense of timing to go off and join Charlie Robertson, Len Barker, and Mike Witt as the most mediocre spinners of perfect games, and thus cement is status as the most insufferable player in baseball. Braden's perfect game offered an opportunity for the world's Red-Ass Admiration Society to come together as one and declare vindication. The silliness of connecting the two situations is self-evident, but that didn't stop the less logical members of the RAAS.

* Baseball writers must be getting a little bored with their usual obsessions, like steroids and contraction. Early this season, the premise that three true outcome baseball is boring has come to the forefront.

I will admit my bias upfront: I quite enjoy three true outcome baseball. I do not agree with the notion that strikeouts, walks, and home runs are boring. Certainly, they can be, and I don't think that the extreme home run levels of, say, 1996 are particularly enjoyable, but in general I think a home run is just about the most exciting play you can imagine. The threat of being able to score at any time, regardless of base/out state, spices up the game in my opinion (I am going to stop prefacing everything with IMO henceforth because obviously it's my opinion). Some people bemoan the fact that middle infielders are now home run threats, but I think the possibility of an instant score up and down the lineup is a good thing.

With regard to strikeouts and walks, I also maintain that these are among the most interesting outcomes. Are four-pitch walks that result more from failing control than the batter/pitcher duel boring? Sure. But I think that a walk drawn in a deep count, with pitches fouled off and/or close pitches taken is actually quite exciting.

That's not to say that balls in play are uninteresting, but I think that the large percentage of them that are routine outs are in fact less interesting than a strikeout. I don't think a routine grounder to short or a fly to medium right is anything to get particularly excited about.

Of course, I don't expect that every baseball fan will share my opinion, and that's fine. These types of discussions get obnoxious when fans start deriding what they don't like as not being "real baseball" (whatever the hell that means), or glorifying the good old days not because they liked the style of play better but because the players were better and the world has gone to hell in a handbasket. This type of rhetoric is employed by a minority, certainly, but it's out there

The other obnoxious thing is when people try to ascribe motives to other's opinions on how the game should be played, like saying that everyone wants baseball to be the way it was when they were ten. There's a lot of truth in that kind of statement, but it's silly to assume that one couldn't come to the same preference through genuine open-mindedness.

All of this is but a skirmish compared to the rhetorical war that erupts over the DH rule. To defuse any idea that I am presenting myself as above the fray here, I have to admit that I do like to throw "Neanderthal League" references around.

* I am not ashamed to admit that I am something of a LOST fanboy, and as such I've been very excited by some people referring to Justin Smoak the "Smoak Monster". So of course I had to go and try to think up some more LOST-inspired nicknames in honor of the series finale this Sunday.

Unfortunately, I am not creative, particularly terrible at puns, and generally unable to come up with anything not cringe-worthy. The only one that I like is calling Lou Piniella "Dr. Arzt". Piniella is certainly a much more self-confident and accomplished person than Leslie, so the parallel is largely based on Arzt's demise.

In the Season One finale, "Exodus", Arzt tagged along with several of the heroes and Rousseau, the crazy French long-time Island resident, on a trip to Black Rock to get dynamite so that they could hide from the Others. The Black Rock is an old slaving ship that happens to be parked several miles inland, because in 1867 it was caught up in a massive storm and brought in on a massive wave that took out the statue of four-toed statue of Taweret, in which Island protector Jacob lived and...yeah, if you don't watch the show, you're probably wondering what I'm smoking.

Arzt insisted on handling the dynamite due to its instability after sitting out in the jungle for over a century. He lectured the other survivors about the history of dynamite and began gesturing with his arms and promptly detonated himself.

And that is why I connect him with Lou Piniella. He loves (used to love?) to go out and put on a show when arguing a call, waving his arms, kicking his hat, throwing bases, etc. He also has a newfound tendency to self-destruct by putting his best starting pitcher in a set up role. Lou Piniella--Dr. Arzt.

Monday, May 03, 2010

Why I Like Secondary Average

Over the summer, there was a thread on Baseball Think Factory about Baseball-Reference and other statistical sites. A poster said that one category missing from the seemingly endless array offered by B-R was secondary average, which led to this reply by Colin Wyers:

Baseball Reference has secondary average.

That said - does anyone need secondary average? Really?

I assume that if you are reading this blog, you are familiar with Colin's work--if not, you should be. I'm certainly not trying to pick on Colin here. Besides, he's right--no one *needs* secondary average. There are two big reasons why not, aside from the actual weighting of events in the metric itself:

1) it's not an overall measure of offensive production--in fact, it is sort of nebulously defined as "stuff that batting average doesn't account for".

2) it's not expressed in a fundamental baseball unit like runs or wins--it's total bases beyond first, plus walks, plus steals per at bat--hardly a unit that cuts to the essence of the game. In the past I've argued that slugging average doesn't have a fundamental baseball foundation, but at least what it (crudely) measures is clear--rate of bases gained by the batter on hits per at bat, and it can be used directly as an input in a crude dynamic model of run scoring (basic Runs Created) . Secondary average's logical basis doesn't even approach that level.

All that being said, I personally like secondary average, and I use it occasionally on this blog. Admittedly it doesn't really have any analytical value, and in a world in which everyone agreed with the central tenets of sabermetrics, there wouldn't really be any need for it.

However, that's not the case, at least not yet. There are still many people who start their evaluation of an offensive player with his batting average. It might be better if they'd start with OBA, or start with OPS, or start with any number of other metrics, but as long as BA is widely-used, secondary average has some utility as a way of considering everything that doesn't go into BA.

Bill James introduced secondary average (I'm going to start using the abbreviation SEC interchangeably) in his Yankees team essay in the 1986 Baseball Abstract. James defined SEC = (TB - H + W + SB)/AB, although some later versions have removed CS and sometimes SB are excluded altogether. James' discussion ran for several pages, and I have lifted some quotes and paraphrased some of his points, as many of his reasons for liking secondary average are the same as mine.

What was the purpose of the new stat? "[It] focuses on the major areas of offensive productivity which are not reflected in the player's batting average...It does not constitute new knowledge; rather, it is a new way of expressing a set of values which have already been accepted."

James explains that he had been endeavoring for several years to express a player's non-BA offensive contributions in a straightforward, easily understandable manner. (Keep in mind that this was even more of a challenge in the mid-80s than it is now, when the importance of OBA and SLG and other sabermetric principles about offense are in the mainstream). He recounts a number of his previous efforts:

1) He tried calling players with significant non-BA contributions "percentage players", but it focused too much on the numbers rather than the player and was already in use to mean something else.

2) He referred to a player having good or bad "peripherals", but thought that the word had some negative connotations. (However, it is now widely used in sabermetrics in regard to pitching statistics.)

3) He tried calling players "Ted Williams" or "Joe Morgan" types, with players of the Joe Morgan class having less power but more speed. James decided that this language invoked specific comparisons, usually inappropriate, to two players and could be misinterpreted as referring to attributes other than offensive shape (including body shape).

4) He referred to players as AA or CC (or BC, etc.) type players, where AA players hit for power and drew walks, CC players did neither, and so forth. This was hard to explain to people not already familiar with his system and James says that even he had trouble keeping straight whether the letters represented (power, walks) or (walks, power).

James then stumbled upon secondary average, and obviously decided to introduce it in the Abstract, making several points about its nature:

1) Average SEC is similar to average BA, on the macro level. For example, in 2009 the AL had a .280 SEC (figured with SB; without, it was .260) and a .267 BA. Before the offensive explosion, the inclusion of steals served to bring the figures closer, generally, on the league level.

2) It makes use of only important, easily available statistical categories--no sacrifices and the like. This was a much bigger deal from the perspective of the mid-80s than it is today, I suppose.

3) "Secondary average is more important (a better indicator of hitting ability) than is batting average" because of the larger spread in individual figures. "One point of secondary average is clearly not as valuable as one point of batting average."

4) SEC is a "collection category" like Total Average--it adds together like things but doesn't attempt to quantify the value differences between them.

In summation, I agree with Mr. James that secondary average is useful to give a quick glance at the size of a player's offensive contribution that does not come from batting average. In 2009, Willy Taveras hit .240, Yunieksy Betancourt .245, Adam Everett .238, Dan Uggla .243, David Oritz .238, and Jack Cust .240. As you well know, though, the latter three were much more productive offensive players. Secondary average quickly and simply captures their contributions that did not go into BA. Their SECs were .379, .360, and .359 respectively. The first three players came in at .089, .151, and .151. Their batting averages are indistinguishable, but secondary average shows that there was a huge productivity gap between the two groups.

So the usefulness of SEC as a gauge is largely tied to the degree to which batting average is used as the starting point for player comparisons. If you are a sabermetrician, or even a user of OPS, then you don't need secondary average; in all likelihood you hardly even look at batting average, which along with the meaningless units of SEC makes an opinion on the statistic like the one I quoted from Colin perfectly understandable. However, if you want a relatively easy tool you might be able to explain to an uninitiated fan, or you like me have a personal bias for players who draw walks and hit for power, then secondary average is a fun, junk-ish stat to have at your disposal.