Monday, August 27, 2012

Sabermetric Generations

Allow me the indulgence of going meta-sabermetric here. I don't really like doing that, as this is the point at which you are writing about sabermetrics itself rather than baseball, and baseball is a heckuva lot more interesting than sabermetrics. However, I think there are some things about the field that those of us who fancy ourselves as sabermetricians should consider. This post is a half-developed missive on some of those things.

My basic premise here is that sabermetric people can be divided into three generations--not perfectly, of course, but as a general classification. I say "people" because I don't want to make it about who is or is not a "sabermetrician" (Do you have to publish your own run estimator to be a sabermetrician? Write a blog? Do any research at all?), but just about people who would consider themselves to be either practitioners or consumers of sabermetric research, or both.

These three generations are not defined strictly by age, but rather by when you came of age as a saberperson (or how, but the when and the how elements are very closely related). My three groups are:

1. Pioneers--This is by far the most restrictive group, as it only includes those who actually did pioneering sabermetric research (whether they called it by that name or not). Earnshaw Cook, George Lindsey, Pete Palmer, Bill James, and the like are the pioneers.

2. Second Wave--These are the folks who came to sabermetrics largely through the work of the pioneers--reading the Baseball Abstract or The Hidden Game or various SABR publications or The Diamond Appraised, and the like. They may or may not have gone on to become researchers themselves; they may just be consumers of research. It is also possible that their own inquisitiveness led them to sabermetrics without a firm push from Bill James or another pioneer, but they still came onto the scene after the work of the pioneers had been published. Many, many people fall into this group, and even listing a few would be foolish. I consider myself in this group although I also share a few traits with the third group, as I explained before.

3. Internet generation--These are people who have come to sabermetrics in the last 10-15 years and may have done so without ever reading the work of the pioneers. Their interest in sabermetrics is young enough to have been fueled by reading the work of second wavers (or of course their own inquisitiveness). A typical path to sabermetrics for a member of the Internet generation would have been to read Rob Neyer. From there, they sought out BP or Bill James.

Before I use these classifications to make a point, I need to issue a couple of disclaimers. The first is that I am not an evangelist for sabermetrics. I don't go to the airport and hand out flowers, or go to people's doors and hand them tracts. I don't really care if you are interested in sabermetrics or not, and I don't tailor my writing to appeal to folks who are on the fence.

So when I express my concern about something below, it's not borne out of any fear of what overzealous internet posters will do the reputation of sabermetrics or anything like that; it is simply out of an ordinary desire for intelligent and factual discourse.

The second disclaimer is that this is not a get off my lawn post. As I stated in the piece linked above, I straddle the fence between the second wave and the internet generation, and while I might wish to place myself in the former group, you could make a reasonable case that I am in fact a member of the latter group. No group is inherently better or worse than any other; this post is about certain negative traits of some members of the internet generation, but there are many positive things that can be said about the internet generation.

The third disclaimer is that this whole matter of putting saberpeople into groups and then describing those groups is obviously dangerous for the same reason that forming any sort of artificial groups of people is. So it should go without saying that I am not claiming that all members of a generation share certain characteristics or behave exactly the same way.

My concern is about the fact that with the wealth of information available today, particularly through sites like Baseball-Reference, Fangraphs, and StatCorner, it has become quite possible for members of the internet generation of saberpeople to cite statistics without really understanding them at all. Of course, second wavers also had this ability, but it wasn't so instantaneous. You had to wait for new books to be published in the spring or you had to go through the trouble of figuring statistics yourself.

Of course, there are many members of the internet generation who do fantastic research, develop their own statistics, or figure stats themselves. They are not who I am talking about. I am talking about the subset of folks who don't do their own research, don't endeavor to truly understand what the numbers mean, and yet still talk about them authoritatively.

With sabermetrics prominent on the internet baseball scene, it is much easier to learn about the field. This is on the whole a very welcome improvement, but it is also easier for people to get indoctrinated into sabermetric principles without fully understanding them. There is a sabermetric-brand of conventional wisdom which can be just as misleading as the conventional wisdom of the traditionalists when it is wielded by an individual who has not done his own legwork, but has simply read it or told it and believed it to be true.

The ubiquity of sabermetric ideas in online discussions of baseball makes it quite possible for members of the internet generation to be introduced to sabermetrics almost simultaneously to the moment at which they become serious baseball fans. This pretty much happened to me as recounted in the earlier linked post, although it was primarily through printed works of pioneers rather than through the internet. That being the case, I can't criticize this path of discovery. However, I do think that there might be a critical thinking advantage to having first accepted the conventional wisdom, gradually looking at it skeptically, and then having those questions reinforced by the discovery of sabermetrics. Today it is quite feasible for young baseball fans to skip the conventional wisdom altogether and jump right into sabermetric ideas.

A specific example of the kind of thing I'm talking about is the notion that any pitching statistic like ERA that does not incorporate DIPS principles is worthless, and that only FIP or tRA or a similar metric is appropriate. The problem is not with the truth that ERA has a lot of biases which have often been overlooked, or that FIP is a better predictor of future performance. The problem comes when a relatively good measure like ERA (and one that measures actual runs allowed, which are unquestionably important if not wholly attributable to the pitcher) is thrown in the dustbin as if it is no more useful or telling than Batting Average or raw RBI count, and that anyone who even considers it is classified as a dinosaur.

In defending ERA, I am not saying that it is inherently wrong to come to the conclusion that FIP or Metric XYZ is not the best tool for measuring pitcher performance--just that it is wrong to reflexively come to that conclusion, and that sometimes the advocates of such a position can take on a zealous tone. This tone often echoes that of reflexive anti-sabermetric screeds.

You may be thinking to yourself that I am attacking a strawman, and that no one actually thinks like that. I didn't quote/link anyone specifically, because singling out message board posters is not the point of this discussion--but sentiments of this sort are out there.

This is where my admission that this post is half-developed becomes painfully clear, because I really don't have anything to offer about what can or should be done about this (given that I am not a sabermetric evangelist, my developed answer would most likely be centered around the premise that individuals are responsible for their own rhetoric). At this point, it will devolve into a paean to do-it-yourself sabermetrics.

There are now at least four large-scale implementations of WAR floating out there--Rally/Baseball-Reference, Fangraphs, BP's WARP, and the Baseball Gauge's WAR. As a result, people will sometimes wonder why sabermetrics can't have a meeting of the minds, hash out all of the differences, and produce one unified version of WAR that can be presented to the wider world.

There are a number of reasons why I think this is a bad idea (the danger of presenting any one metric as *the* uberstat; the legitimate uses of alternate baselines, run estimators, park factors, league adjustments, position adjustments, and all of the other components that go into WAR; the notion that consensus is a positive for its own sake), but that's irrelevant, because it will never happen. The hypothetical moment that it did happen would prove Gary Huckaby right--sabermetrics would be dead. Enforcing standardization for problems in which the answer is often subjective would discourage innovation and discredit alternate views on the question of ability v. value (among others).

Additionally, the existence of one version of WAR, widely accepted and presumably published by major websites, would in my estimation do more to discourage people from figuring their own statistics than anything else in the history of the field--more that Total Baseball or Baseball-Reference or Fangraphs. If you were a new convert to sabermetric thinking, and were told that there was one metric that was the best and was freely available, what would be your most likely reaction?

1) Awesome, this sabermetrics thing is not nearly as complicated as all of the screeds suggested it would be.

2) Darnit, I was hoping to find the unified theory of everything myself.

3) Darnit, I can't believe I don't get to try to figure out how to use a slide rule, or make a spreadsheet, or do SQL coding, or however these saberwhatevers get their results.

There is a lot of value in figuring your own sabermetric statistics, even by just applying other people's metrics. There's no better way to learn about the inputs and how they are combined than by actually walking through the process yourself.

At one time, in order to have up-to-date access to sabermetric stats, you pretty much had no choice but to figure them yourself. This resulted in a waste of a lot of man-hours of sabermetricians, but it also meant a group of people that were better informed about the construction and therefore the objectives and philosophy of the metrics they were using. The easy availability of statistics today is a great boon to the field, to be sure, and it is particularly great for serious practitioners who already have a good understanding of metric construction. I am not a luddite--the downside of a potentially less-informed average consumer of sabermetrics does not outweigh the benefits--but it also should serve as an impetus for transparency in computational explanations and for continuing reminders of the why in addition to the numerical results themselves.

Monday, August 20, 2012

Ballpark Thoughts

My apologies to anyone who reads the title of this post and expects to read a discussion of some arcane aspect of estimating park factors. On Saturday I visited Great American Ball Park in Cincinnati for the first time for the Cubs/Reds doubleheader. I am by no means a well-traveled fan when it comes to attending MLB stadiums--GABP raises my lifetime count to five (Jacobs Field, Municipal Stadium, PNC Park, Tropicana Field). What follows are some random thoughts (and snark at the expense of my southern neighbors):

* I expected to be underwhelmed by the ballpark. I can’t point to any particular influence, but I thought that the general consensus on GABP was that it was not on par with the best of the new parks.

Having only been to four of the current parks, I can’t place GABP on a grand continuum, but given my lowered expectations, I was quite impressed. For the day game, I purposefully bought a ticket in the very last row of the stadium behind home plate. My main motivation was shade, but it offered a great opportunity to take in the park from the bird’s eye view. For the second game, I sat in the right field moon deck, which was useful because it provided the opposite orientation.

The Ohio River and the hills of Kentucky provide nice scenery for those facing the outfield. While the park is not situated close enough to the field to allow for splash hits, the river is certainly quite prominent in the view past right field, and the lack of a second or third deck in right field leaves the scenery largely unimpeded.

Those looking towards left field have a much less interesting view--the left field bleachers and the basketball arena block any view. The impediment of the arena validates the decision to leave right field open, as otherwise you wouldn't be able to see much beyond the park from any perspective.

From the outfield perspective, you mostly just have a view of the stadium. Most of the skyline is obscured, with the top of the (surprise) Great American building the highlight. The PNC “power stacks” are a bit of an annoyance from these seats--not because they block the view (they’re behind you) but because you can feel the heat from the napalm or whatever exactly it is that they shoot off. This would be a plus at a cold April game, but is annoying otherwise.

Of course, all this talk about the view misses the primary point of visiting a stadium, which is to watch the ballgame and not the scenery.

* The silliness of the attempt to manufacture a “game day experience” is by no means unique to GABP nor to MLB, but in my limited experience Cincinnati takes the cake. One of the most annoying features added at Jacobs Field over the last few seasons are way too cheery “hosts” who appear on the scoreboard, going around the park and telling you about all the exciting and fun things to do at the park. GABP has these as well, so I was edified about the pre-game concert featuring a band doing a particularly bland cover of the Black Crowes’ version of “Hard to Handle”, about the Big Red Machine exhibit at the Reds Hall of Fame (the Reds “dominated baseball in the 70s”...I’m sure the A’s really felt dominated), and various other distractions.

The Reds also feature an extraordinary number of mascots. There are four. One is Gapper, the generic fuzzy monster that almost every team has a version of. The next is Mr. Red, a giant baseball head who manages to look much more menacing than Mr. Met. Then there is Mr. Redlegs, the old time giant baseball head (you can tell by his taste in mustaches) who appears to cheat at the mascot race (only conducted on the scoreboard, which is a plus). Finally, there is Rosie the Red, easily the most creepy mascot in history.

Prior to the day game (the night game was not as bloated), there were three first pitches (a logical conundrum), a kid yelling “play ball”, an honorary captain, and a delivery of the official game ball to the mound. Many traditional religious services have less ceremony.

Again, none of this is unique to Cincinnati, but I’ve previously been fortunate to not encounter so much of it at once.

* For the night game, the first 20,000 fans received a 1995 replica hat with Barry Larkin’s #11 on the side. The good people of Cincinnati really wanted to ensure that they received their hats. The lines to get into the ballpark before the gate opened were very long. I wasn’t quite ready to go into the park yet (I like getting there early, but ninety minutes is a little much for me), but I bowed to the inevitable and got in line to ensure that I too would receive a 1995 Reds hat.

* Two things struck me about the area surrounding the ballpark. The first was the pandhandlers. Now maybe I don’t go to the right (wrong) places, but in the two cities I am most familiar with (Cleveland and Columbus), the panhandlers are not nearly as sophisticated. All of the Cincinnati pandhandlers stand there with cardboard signs displaying their sob stories.

The other is scalpers, or more precisely, the lack thereof. Apparently, Cincinnati has fairly strict (and thus asinine) laws against selling tickets above face value. Surely this is still happening, but it is clearly done more discretely than it is around Jacobs Field or Nationwide Arena.

* Along the river, there are a set of columns which have plaques on each side. These comprise the Steamboat Hall of Fame. Spoiler alert: Many of the steamboats in the Steamboat Hall of Fame met unpleasant demises.

* Charging admission to your team’s Hall of Fame is almost as pretentious as pretending that your team was established in 1869 when it was actually established in 1882.

Monday, August 13, 2012

Standing Still

2012 marked the second season of Greg Beals’ tenure as OSU baseball coach. It was not an encouraging season for the future of the program, and I’ve never been more perplexed by on-field strategy, which is saying something for college baseball. This is not to say that I’m declaring Beals to be incapable of restoring the program to the excellence it enjoyed throughout most of Bob Todd’s tenure, but it looks like it will be a long road from here.

The Buckeyes overall record did improve slightly, from 25-26 to 33-27. But the difference was wrapped up in non-conference play, as OSU’s Big Ten record was essentially unchanged (12-11 to 13-13). OSU played a much less ambitious early season schedule, playing five games against southern teams (Georgia Tech and Coastal Carolina) but the rest against northern opponents. In fairness, OSU’s ISR (Boyd Nation’s ranking system) did rise to #94 from #156.

In conference play, OSU was consistently mediocre. OSU took one of three in series against Purdue, MSU, Nebraska, Illinois, and PSU. The other three series were sweeps--two at home in Ohio’s favor against Minnesota and Northwestern, and one to Indiana on the road. The wipeout in Bloomington came in the season’s final weekend and dropped the Bucks into a three-way tie for the sixth and final seed in the Big Ten Tournament. While the Buckeyes won the tiebreaker, it certainly felt as if they had backed into it.

In the Tournament (the last in a four-year arrangement to hold the event at Huntington Park), the Buckeyes rallied to beat Penn State, then lost to #1 seed Purdue. Once in the losers bracket, they beat Nebraska to stay alive but had their season ended by MSU.

OSU’s .550 W% ranked fourth in the Big Ten (Purdue led at .763) and fifth in EW% with a similar .548 (Purdue led at .732). In PW%, OSU looked a little worse (.520, fifth) with Purdue sweeping the W% flavors at .723. The Buckeyes ranked in the middle of the pack in both runs scored (5th at 5.52) and runs allowed (6th at 5.00).

OSU’s offense did one thing well, something that is close to my heart--draw walks. Their .140 W/AB ratio led the conference, well above the average of .100 and far above Illinois’ .107 which ranked second. In fact, Big Ten walk rates were tightly clustered, with the other ten teams ranging from just .089-.107. OSU’s ranked in the middle of the pack in batting average (.269 versus a .278 average) and isolated power (.086 versus the .100 average). The Bucks tacked on the conference’s most productive stolen base effort, leading the conference with 86 steals against 27 caught. This fact was surprising to me for reasons I’ll expand upon below.

Catcher remained a rough spot for OSU, as junior Greg Solomon posted a 38/6 K/W ratio and .252/.283/.396 line. Freshman Aaron Gretz showed a terrific eye (19 walks in 91 at bats), but little else (.253/.382/.286). First baseman Josh Dezse repeated as one of the team’s most productive hitters (second on the team at 11 RAA), but his power remained an enigma. Dezse tied the school record by belting three homers in a game at Georgia Tech, but hit just two for the remainder of the season. His .120 ISO represented a twenty point drop from his freshman season.

Second baseman Ryan Cypret had a nightmarish campaign a year after being one of the team’s most productive hitters, slumping to .236/.337/.304. Third baseman Brad Hallberg turned in a terrific senior campaign, leading the team with 12 RAA on the strength of a .311/.400/.431 line. Sophomore transfer Kirby Pellant represented an upgrade over OSU’s 2011 shortstop production, but at .274/.358/.340 was below average (-2 RAA).

Freshman Pat Porter took over the left field job as the season progressed, and compiling a pretty average .266/.360/.322 (if you are noticing a pattern, this team had a lot of middling averages and high walk rates with minimal power). Sophomore Tim Wetzel was actually the team’s third-most productive hitter by RAA (+6) thanks to his team leading OBA (.403), but no thanks to his lack of power (.056 ISO for a .336 SLG). David Corna, the primary right fielder, had a rough senior season (.241/.317/.390). Sophomore transfer Mike Carroll (.279/.360/.333 in 186 PA) and Joe Ciamacco (.291/.342/.330 in 111 PA) filled out most of the remaining playing time in the outfield corners and at DH.

Only two other players got significant playing time. Senior Brad Hutton served as part-time DH against left-handed pitchers, managing an average RG thanks to his walks (.220/.350/.340 in 60 PA). Freshman Ryan Leffel served as the utility infielder and could be OSU’s third baseman in 2013. For 2012, though, he could have been called “Josh Dezse’s glove”, as his main role was taking over third base when Dezse moved from first base to the mound (with Hallberg moving from third to first). Leffel appeared in 39 games, but only 3 were starts, and he was limited to 28 PA.

OSU’s fielding (admittedly these metrics leave a lot to be desired) was unremarkable, matching the conference average with a .941 mFA with a .674 DER versus the average of .677.

Before the season, OSU’s weekend starters were expected to be junior Brett McKinney, lefty JUCO transfer Brian King, and sophomore Greg Greve. But McKinney and Greve pitched poorly and lost their spots, with sophomore transfer Jaron Long emerging as the staff ace. Long made three relief appearances before establishing himself as the #1, and was the only starter to turn in an above average performance (+17 RAA). Long is a finesse righty who works in the high eighties at best, relying on his control (just 1.2 W/9).

King slotted in as the #2 starter, a bit of a disappointment given the hype with which he arrived. King was solidly average with -1 RAA. The #3 spot remained in flux until midway through the Big Ten season, when sophomore transfer John Kuchno earned the job. Kuchno was not particularly effective (-7 RAA), but his size and arm made him an eighteenth round pick of the Pirates, with whom he signed.

Greve (-3 RAA in 50 innings) and McKinney (-3 RAA in 71 innings) served as midweek starters and will again vie for the rotation in 2013. The bullpen was anchored by Dezse, who was very effective (2.86 RA, +7 RAA in just 28 innings for seven saves). His strikeout rate (6.0 K/9) continues to lag behind his stuff. Beals only had one lefty with experience in the pen, so senior Andrew Armstrong led the team with 36 appearances spanning just 28 innings. Unfortunately, Armstrong was not nearly as effective as in ’11, his 6.75 RA driven by 26 walks in just 28 innings. Junior sidearmer David Fathalikhani was effective, +5 RAA over 29 innings as the Armstrongs’ matchup counterpart. Freshman Trace Dempsey is being groomed as Fath’s replacement, but was not effective in his freshman campaign (5.63 RA in 32 innings).

In a second season of observing Greg Beals as coach, I have become absolutely mystified by the man’s strategy. Beals has increased OSU’s reliance on the bunt and basestealing. OSU’s ratio of sacrifices to (singles + walks) was .06 in 2012 and .07 in 2011, compared to .03, .05, .03 in Todd’s last three seasons. Beals called for many more steals this season as well, which worked out well--OSU led the Big Ten in steals with a solid percentage (75).

However, it was Beals’ fascination with one particular stolen base play that really gets my blood boiling. Beals is obsessed with the delayed steal of home with 2 outs, runners at the corners. Beals surely dreams about this play every night. I wish I had an easy way of counting how many times this was attempted, but my rough guess is once per series. It rarely worked; it might be insulting to call it a high school-level play. It was especially absurd to keep trotting it out in Big Ten play, as if the other coaches in the conference were a bunch of rubes with no ability to scout and no institutional memory.

OSU will be a popular pick to compete for the Big Ten title in 2013. The only key players who were lost to graduation/draft are third baseman Hallberg, right fielder Corna, starter Kuchno, and reliever Armstrong. The incoming freshman class was not ravaged by draft signings as Beals’ 2012 group was, and figures to infuse some pitching options. But my observation (anecdotal only) is that some of the most overrated teams in college sports are mediocre teams that return a lot of starters. The Buckeyes lack power, they lack quality starting pitching outside of Long, and to date they lack a coach who has proven that he can assemble a Big Ten contender.

Monday, August 06, 2012

All Models Are Wrong

The statistician George Box once wrote that “Essentially, all models are wrong, but some are useful.” Whether the context in which this was written is identical or even particularly close to the sabermetric issues I’m going to touch on isn’t really the point. Perfect models only occur when severe constraints can be imposed.

Since you can’t have a perfect model, the designer and user must decide what level of accuracy is acceptable for the purposes for which the model will be used. This is a question on which the designers and evaluators of sabermetric methods and the end users often disagree. In my stuff (I’ll use “stuff” because “research” is too pretentious and “work” is too sterile), my checklist of ideal properties would go something like this:

1. Does the method work under normal conditions? (i.e. does the run estimator make accurate predictions for teams at normal major league levels of offense)--This is the first hurdle that any sabermetric method must clear. If you can’t even do this, then your model truly is useless.

2. Does the method work for unusual conditions?--I'm going to draw a distinction here between “unusual” and “extreme”. Unusual conditions are those that represent the tails of the usual distributions we observe in baseball. A 70-92 team is not unusual, but a 54-108 team is (and keep in mind that individuals will exhibit a wider range of performance than teams). If the method fails under unusual conditions, it may still be useful, but extreme caution has to be taken when using it. Teams and players that threaten to break it will occur often enough that one will quickly tire of repeating the caveats.

3. Does the method work for extreme conditions?--Extreme conditions are the most unusual of cases, ones that will never occur at high levels of baseball over large samples. A player that hits a home run in every at bat will obviously never exist (although a player could have a game in which he hits a home run in every at bat). A method that can’t handle these extremes can still be quite useful. However, a method that can handle the extremes is much more likely to be a faithful representation of the underlying process. Furthermore, methods that are not accurate at extremes must begin to break down somewhere along the way, so if a model truly works better at the extremes, there’s a good chance it will also produce marginally better results than an alternative model for some of the unusual cases which will actually be observed.

And simply as a matter of advancing our knowledge of baseball, a model that works at the extremes helps us understand the true relationship between the inputs and outputs. Take W% estimators as an example. We know that, as a rule of thumb, 10 runs = 1 win. This rule of thumb holds very well for teams in the usual range of run differentials in run-of-the-mill major league scoring conditions. But why does it work? Anyone with a spreadsheet can run a regression and demonstrate the rule of thumb, but a model that works at the extremes can demonstrate why it is true and how the relationship differs as we move away from those typical conditions.

4. How simple is the method?--I differ from many other people in the relative weight given to the question of simplicity. There are some people who would rank it #1 or #2 on their own list, and would be willing to accept a much less accurate estimator if it was also much simpler.

Obviously, if two approaches are equally accurate (or very close to being equally accurate), it makes sense to use the simpler one. But I’m not a fan of sacrificing any accuracy if I don’t have to. Limits to this principle are much more relevant in fields in which more advanced models are used. However, most sabermetric models (particularly for the type of questions that I’ve always focused on) really are not complex at all and do not tax computer systems or trigger any other practical constraints on complexity. People might say that Base Runs is more complex than Runs Created, but the differences between common sabermetric methods are on the level of adding one more step or one more operation.

Now that I’ve attempted to define where I am coming from on this general question, the specific trigger for this post was Aroldis Chapman’s FIP. Trent Rosencrans of CBS pointed out that Chapman’s FIP for July was negative. It goes without saying that this is a breakdown in the model, as a negative number of runs scored makes no sense.

I have always been prone to overreact to a few comments on a site, and run off and compose a blog to respond not to a strawman, but to an extreme minority opinion. That’s probably the case here.

Still, it is interesting which metrics and which results set off this sort of reaction, and which generally don’t. A number of us tried unsuccessfully for years to argue for the replacement of Runs Created with Base Runs, but a number of very intelligent people resisted the idea. Arguments were advanced that Base Runs was too complex or that given the errors inherent to the exercise, Runs Created was good enough.

OPS+ also remains in use as a go to quick and dirty stat, due largely to its prominence on Baseball-Reference. Yet OPS+ clearly undervalues OBA and can also, in extreme situations, return a negative number of implied runs. ERA+ distorts the true difference in runs allowed rate and makes errors in aggregation across pitchers and seasons seem natural, but B-R had to backtrack quickly when they replaced it with something better because of the backlash.

It should be noted that some of the reaction to any flaws in FIP are likely related to general disagreement and distrust of DIPS theory as a whole (Colin Wyers pointed this possibility out to me). DIPS has always engendered a somewhat visceral reaction and remains the most controversial piece of the standard sabermetric toolkit. An obviously flawed result from the most ubiquitous member of the DIPS family is the perfect opportunity to lash out.

It’s no mystery why FIP fails for Chapman’s July. FIP is based on linear weights, and any linear weight estimator of absolute runs will return a negative estimate at low levels of offense. For example, a game in which there are three singles, one double, one walk, and 27 outs will result in a run estimator of around -.1 runs. [3*.5 + 1*.8 + 1*.3 - 27*.1] Run scoring is not a linear process, but there are many advantages to using a linear approximation (I won’t recap those here but this should be familiar territory). However, if the weights are tailored for a normal environment, and become less reliable as the environment becomes more extreme.

In July, Chapman’s line looked like this:

IP H HR W K
14.1 6 0 2 31

There is no linear run estimator that will be accurate for a normal context that will survive something this extreme. Fortunately, it is an extreme context observed over a very small sample size. A month may superficially seem like a significant split, but fourteen innings is roughly equivalent to two starts.

Going back to my checklist for a metric above, I do place a high weight on accuracy for extremes. While I am personally comfortable with the use of FIP for quick and dirty situation, I happen to agree with some of the critics that it isn’t particularly appropriate for use in a WAR calculation (presuming that you want to include a defense-independent metric as the input for WAR at all). It doesn’t make a lot of sense to sweat the small stuff as WAR does, except for the main driver (FIP). While the issue of negative runs will not be present over the long haul, using a linear run estimator for individual pitchers is needlessly imprecise by my criteria.

There are at least a couple different Base Runs-based DIPS formulas out there--Voros McCracken himself has one), I have one (see “dRA” here), and there could be others that have slipped my mind. Using dRA, Chapman’s July checks in at .68, which seems pretty reasonable.

The moral of the story is that our methods will always be flawed in one manner or another. Sometimes the designer or user has a tradeoff to make between simplicity and theoretical accuracy. Depending on the question the metric is being used to answer, there may be a reason to change one’s priorities in choosing a metric. Ideally, one should be as consistent as possible in making those choices. At the risk of painting with a broad brush, it is that consistency that appears to be lacking in some of the reaction in this case.