Monday, August 06, 2012

All Models Are Wrong

The statistician George Box once wrote that “Essentially, all models are wrong, but some are useful.” Whether the context in which this was written is identical or even particularly close to the sabermetric issues I’m going to touch on isn’t really the point. Perfect models only occur when severe constraints can be imposed.

Since you can’t have a perfect model, the designer and user must decide what level of accuracy is acceptable for the purposes for which the model will be used. This is a question on which the designers and evaluators of sabermetric methods and the end users often disagree. In my stuff (I’ll use “stuff” because “research” is too pretentious and “work” is too sterile), my checklist of ideal properties would go something like this:

1. Does the method work under normal conditions? (i.e. does the run estimator make accurate predictions for teams at normal major league levels of offense)--This is the first hurdle that any sabermetric method must clear. If you can’t even do this, then your model truly is useless.

2. Does the method work for unusual conditions?--I'm going to draw a distinction here between “unusual” and “extreme”. Unusual conditions are those that represent the tails of the usual distributions we observe in baseball. A 70-92 team is not unusual, but a 54-108 team is (and keep in mind that individuals will exhibit a wider range of performance than teams). If the method fails under unusual conditions, it may still be useful, but extreme caution has to be taken when using it. Teams and players that threaten to break it will occur often enough that one will quickly tire of repeating the caveats.

3. Does the method work for extreme conditions?--Extreme conditions are the most unusual of cases, ones that will never occur at high levels of baseball over large samples. A player that hits a home run in every at bat will obviously never exist (although a player could have a game in which he hits a home run in every at bat). A method that can’t handle these extremes can still be quite useful. However, a method that can handle the extremes is much more likely to be a faithful representation of the underlying process. Furthermore, methods that are not accurate at extremes must begin to break down somewhere along the way, so if a model truly works better at the extremes, there’s a good chance it will also produce marginally better results than an alternative model for some of the unusual cases which will actually be observed.

And simply as a matter of advancing our knowledge of baseball, a model that works at the extremes helps us understand the true relationship between the inputs and outputs. Take W% estimators as an example. We know that, as a rule of thumb, 10 runs = 1 win. This rule of thumb holds very well for teams in the usual range of run differentials in run-of-the-mill major league scoring conditions. But why does it work? Anyone with a spreadsheet can run a regression and demonstrate the rule of thumb, but a model that works at the extremes can demonstrate why it is true and how the relationship differs as we move away from those typical conditions.

4. How simple is the method?--I differ from many other people in the relative weight given to the question of simplicity. There are some people who would rank it #1 or #2 on their own list, and would be willing to accept a much less accurate estimator if it was also much simpler.

Obviously, if two approaches are equally accurate (or very close to being equally accurate), it makes sense to use the simpler one. But I’m not a fan of sacrificing any accuracy if I don’t have to. Limits to this principle are much more relevant in fields in which more advanced models are used. However, most sabermetric models (particularly for the type of questions that I’ve always focused on) really are not complex at all and do not tax computer systems or trigger any other practical constraints on complexity. People might say that Base Runs is more complex than Runs Created, but the differences between common sabermetric methods are on the level of adding one more step or one more operation.

Now that I’ve attempted to define where I am coming from on this general question, the specific trigger for this post was Aroldis Chapman’s FIP. Trent Rosencrans of CBS pointed out that Chapman’s FIP for July was negative. It goes without saying that this is a breakdown in the model, as a negative number of runs scored makes no sense.

I have always been prone to overreact to a few comments on a site, and run off and compose a blog to respond not to a strawman, but to an extreme minority opinion. That’s probably the case here.

Still, it is interesting which metrics and which results set off this sort of reaction, and which generally don’t. A number of us tried unsuccessfully for years to argue for the replacement of Runs Created with Base Runs, but a number of very intelligent people resisted the idea. Arguments were advanced that Base Runs was too complex or that given the errors inherent to the exercise, Runs Created was good enough.

OPS+ also remains in use as a go to quick and dirty stat, due largely to its prominence on Baseball-Reference. Yet OPS+ clearly undervalues OBA and can also, in extreme situations, return a negative number of implied runs. ERA+ distorts the true difference in runs allowed rate and makes errors in aggregation across pitchers and seasons seem natural, but B-R had to backtrack quickly when they replaced it with something better because of the backlash.

It should be noted that some of the reaction to any flaws in FIP are likely related to general disagreement and distrust of DIPS theory as a whole (Colin Wyers pointed this possibility out to me). DIPS has always engendered a somewhat visceral reaction and remains the most controversial piece of the standard sabermetric toolkit. An obviously flawed result from the most ubiquitous member of the DIPS family is the perfect opportunity to lash out.

It’s no mystery why FIP fails for Chapman’s July. FIP is based on linear weights, and any linear weight estimator of absolute runs will return a negative estimate at low levels of offense. For example, a game in which there are three singles, one double, one walk, and 27 outs will result in a run estimator of around -.1 runs. [3*.5 + 1*.8 + 1*.3 - 27*.1] Run scoring is not a linear process, but there are many advantages to using a linear approximation (I won’t recap those here but this should be familiar territory). However, if the weights are tailored for a normal environment, and become less reliable as the environment becomes more extreme.

In July, Chapman’s line looked like this:

14.1 6 0 2 31

There is no linear run estimator that will be accurate for a normal context that will survive something this extreme. Fortunately, it is an extreme context observed over a very small sample size. A month may superficially seem like a significant split, but fourteen innings is roughly equivalent to two starts.

Going back to my checklist for a metric above, I do place a high weight on accuracy for extremes. While I am personally comfortable with the use of FIP for quick and dirty situation, I happen to agree with some of the critics that it isn’t particularly appropriate for use in a WAR calculation (presuming that you want to include a defense-independent metric as the input for WAR at all). It doesn’t make a lot of sense to sweat the small stuff as WAR does, except for the main driver (FIP). While the issue of negative runs will not be present over the long haul, using a linear run estimator for individual pitchers is needlessly imprecise by my criteria.

There are at least a couple different Base Runs-based DIPS formulas out there--Voros McCracken himself has one), I have one (see “dRA” here), and there could be others that have slipped my mind. Using dRA, Chapman’s July checks in at .68, which seems pretty reasonable.

The moral of the story is that our methods will always be flawed in one manner or another. Sometimes the designer or user has a tradeoff to make between simplicity and theoretical accuracy. Depending on the question the metric is being used to answer, there may be a reason to change one’s priorities in choosing a metric. Ideally, one should be as consistent as possible in making those choices. At the risk of painting with a broad brush, it is that consistency that appears to be lacking in some of the reaction in this case.

1 comment:

  1. Nice post, Patriot. I'm definitely someone who values simplicity. I'd ranked it third, perhaps second based on the line between unusual and extreme. That's because I'm not smart enough to understand complicated models, and I don't trust something I don't understand, even if it appears to work. I think this is a distinction that separates a lot of sabermetric discussions.


Comments are moderated, so there will be a lag between your post and it actually appearing. I reserve the right to reject any comment for any reason.