Wednesday, July 15, 2020

Hallmarks of Quality Metrics

Another old post I never published, probably because it was repetitive of sentiments I'd written before. I'm guessing I must have encountered a metric that really annoyed me and this was written as a responsive missive.

In a previous article I discussed some of the shortcomings of OPS as an advanced metric, which naturally leads to the question: “What are the characteristics of good advanced metrics?” While the relative importance that one places on each criterion is up for debate (the list that follows is in no particular order), the following considerations should be relatively non-controversial. I’ve used the term “metric” to refer to any statistic or derived category, which is not precise terminology:

1. Clear purpose

Before one designs a metric or uses it to answer a question, it’s imperative that the question of interest be defined. What is the metric setting out to measure? Most metrics in use, even those that are not in favor with sabermetricians, do fairly well on this score. Counting statistics, regardless of their ultimate utility, are largely clear in terms of definition and meaning. Some, like hits or strikeouts, are inherently obvious. Those with more involved definitions often still have a clear purpose even if the execution of that idea is somewhat muddled (like errors).

2. Developed with a theory in mind

This criterion is closely related to a clear purpose, but takes it a step further by questioning the thought process that went into developing the metric. OPS doesn’t fail, as it is based on the reasonable notion that hitting can be broken down into the broad categories of getting runners on base (OBA) and advancing them (SLG). However, due to the somewhat arbitrary nature in which the two statistics are combined, OPS does not match up to metrics like wOBA and True Average which have as their basis a linear weight model of the run scoring process. Some proposed metrics fail spectacularly, though, as they simply combine statistical columns without any particular rhyme or reason. Thankfully, most of these fail to gain traction, but some fail to gain traction yet still have their own Wikipedia pages. Metrics of this type may appear to “work” as they will generally produce reasonable leader boards, but the same could be said for any haphazard combination of positive events and categories.

3. Accurate

A metric should result in an accurate estimate of whatever it is designed to measure. For instance, a metric that attempts to measure offense productivity should have a strong correlation with team runs scored as scoring runs is the prime objective for an offense. The best-performing models for estimating team runs scored tend to be based on either dynamic models of the run scoring process (such as David Smyth’s Base Runs) or linear weight models (pioneered by George Lindsey and Pete Palmer and now in wide use). Thus it stands to reason that metrics built on linear weights (such as wOBA) are a better tool to use when evaluating offensive production than alternatives that do not correlate as well with runs scored.

Sometimes, though, it is not easy to measure accuracy due to a lack of data to verify against or a desire to use the metric to address a similar but subtly distinct question. For example, metrics validated against team results are often used to measure individual performance, which leads to the next criterion.

4. Adaptable over a wide range of contexts

While there is nothing inherently wrong with a metric that is designed to work only under a limited set of conditions--so long as said metric is not stretched beyond its capabilities--it is better still to be confident that the metric will produce reasonable results for a broader set of questions.

Sometimes metrics work well over normal ranges of performance and thus provide reasonable answers for most questions. For example, the common rule of thumb that 10 runs = 1 win is quite accurate at predicting the win totals of major league teams from their runs scored and allowed. However, the actual relationship between runs and wins is not linear—it only appears to be linear because the conversion is calibrated over a narrow set of possible outcomes. When the model is applied to more extreme conditions (which in this case could be an average level of runs scored per game much different than major league norms or teams with very low or very high run differentials), the accuracy will suffer. A dynamic model of estimated winning percentage (such as Pythagenport) can maintain accuracy over a wider range of scenarios.

A related but slightly different issue occurs when some metrics that are designed for use with team data are applied to individuals. A classic example is Bill James’ original version(s) of Runs Created, which recognizes the dynamic relationship between getting runners on base and advancing them. When applied to an individual’s statistics, though, the implication is that the player is reaching base, then advancing himself around the bases, whereas he actually interacts with his teammates. The resulting distortion requires that caution be used when interpreting Runs Created estimates for individual players.

5. Expressed in meaningful units

Ideally, the metric should return a result that has a logical, interpretable baseball meaning. Metrics expressed in terms of runs and wins are ideal since the connection to the objective of the game is made clear, but there are any number of other expressions that can be meaningful. On Base Average, for instance, represents the percentage of plate appearances in which a batter reaches safely, which is easy to explain and easy to think about in terms of on-field implications.

In some rare instances, it is next to impossible to express a result in meaningful units and so a nebulous value must suffice. One example is Bill James’ Speed Score, which endeavors to estimate a player’s speed skill by taking into account a number of categories related to speed (such as stolen base attempt frequency, rate of triples per ball in play, defensive range, etc.) Since there is no single manifestation of speed on the field and no obvious units to capture baseball speed, James uses an abstract scale.

6. Not needlessly complex

It is certainly tempting to say that metrics should be simple, but in my opinion simplicity need not be a goal unto itself. What is important is that the metric not make things more complicated than they need to be.

However, describing complex processes sometimes necessitates the use of complex models. The key is to avoid complexity for its own sake and phony precision. The end use and user of the metric should also be considered—if a “quick and dirty” estimate will suffice, then a simple metric may suffice, but a more complex metric can be used when a true best estimate is needed.

7. Catchy Name

This final entry is somewhat tongue-in-cheek, as it is irrelevant to the quality of a metric, but there’s no denying that when it comes to mainstream acceptance, marketing matters. To bring things full circle, a good name succinctly references the intended purpose and use of the metric while providing a minimum amount of ammunition to those looking to mock the field. Whether any sabermetric measures score particularly well on this front will be left as a rhetorical question for the reader.