Walk Like a Sabermetrician: Crude Team Ratings, 2012

For the last few years I have published a set of team ratings that I call “Crude Team Ratings”. The name was chosen to reflect the nature of the ratings--they have a number of limitations, of which I documented several when I introduced the methodology. Crude or not, this was a banner year in sabermetrics in which to be a purveyor of a team rating system, and I wouldn’t want to miss out on the fun.

The silliness of the Fangraphs power rankings and the eventual decision to modify them (while shifting blame to defensive shifts for odd results rather than the logic of the method) offered an opportunity to consider the nature of such systems. When you think about, team ratings actually may be the most controversial and important objective methods used in sports analysis. As sabermetricians it is easy to overlook this because they don’t play a large role in baseball analysis. But for college sports, rating systems are not just a way to draw up lists of teams--they help determine which teams are invited to compete for the national championship. And while most teams with a chance to win the championship in sports with large tournaments are comfortably in the field by any measure, in college football ranking systems are asked to make distinctions between multiple teams that would be capable of winning the title if permitted to compete for it.

There are any number of possible ways to define a team rating system, but to simply things I will propose two broad questions which should be asked before such a system is devised:

1. Do you wish to rank teams based on their bottom line results (wins and losses), or include other distinguishing factors (underlying performance, generally in terms of runs/runs allowed or predictors thereof)?

I would contend that if you are using team ratings to filter championship contenders, it is inappropriate to consider the nature of wins and losses, only the binary outcomes. If you are attempting to predict how teams will perform in the future, then you’d be a fool not to consider other factors.

2. Do you wish to incorporate information about the strength of the team at any given moment in time, or do you wish to evaluate the team on its entire body of work?

I would contend that for use as a championship filter, the entire body of work should be considered, with no adjustments made for injuries, trades, performance by calendar, etc. If you are using ratings to place bets, then ignoring these factors means that you must consider sports books to be the worthiest of charities.

Obviously my two questions and accompanying answers painted in broad strokes. But defining what you are setting out to measure in excessively broad strokes is always preferable to charging ahead with no underlying logic and no attempt to justify (or even define) that logic. Regardless of how big your website is, how advanced your metrics are, how widely used your underlying metric is for other purposes, how much self-assuredness you make your pronouncements with, or who is writing the blurbs for each team, if you don’t address basic questions of this nature, your final product is going to be an embarrassing mess. Fangraphs learned that the hard way.

For the two basic questions above, CTR offers some flexibility on the first question. It can only use team win ratio as an input, but that win ratio can be estimated. In this post I’ll present four variations--based on actual wins and losses, based on runs/runs allowed, based on game distribution-adjusted runs/runs allowed, and based on runs created/runs created allowed. You could think up other inputs or any number of permutations thereof (such as actual wins/losses regressed 25% or a weighted average of actual and Pythagorean record, etc.). On the second question, CTR has no real choice but to use the team’s entire body of work.

I explained how CTR is figured in the post linked at the top of this article, but in short:

1) Start with a win ratio figure for each team. It could be actual win ratio, or an estimated win ratio.

2) Figure the average win ratio of the team’s opponents.

3) Adjust for strength of schedule, resulting in a new set of ratings.

4) Begin the process again. Repeat until the ratings stabilize.

First, CTR based on actual wins and losses. In the table, “aW%” is the winning percentage equivalent implied by the CTR and “SOS” is the measure of strength of schedule--the average CTR of a team’s opponents. The rank columns provide each team’s rank in CTR and SOS:

While Washington had MLB’s best record at 98-64, they only rank fifth in CTR. aW% suggests that their 98 wins was equivalent to 95 wins against a perfectly balanced schedule, while the Yankees’ 95 wins was equivalent to 99 wins.

Rather than comment further about teams with CTRs that diverge from their records, we can just look at the average CTR by division and league. Since schedules are largely tied to division, just looking at the division CTRs explains most of the differences. A bonus is that once again they provide an opportunity to take gratuitous shots at the National League:

This is actual a worse performance for the NL than in 2011. Going back to 2009, the NL’s CTR has been 89, 93, 97, 89. The NL Central remained the worst division, dragged down to a dismal rating of 81 by the Astros and Cubs and the lowest divisional rating I’ve encountered in the four years I’ve figured these ratings. This explains why the best teams in the NL Central have the lowest SOS figures. 2012 marks the first time that AL East has not graded out as the top division in those four seasons, although its 121 CTR is higher than for some of the years in which it ranked #1.

This year it may be worth considering how this breakout would look if Houston was counted towards the ratings for the AL (West) as they will in 2013. It helps the NL cause, naturally, but it isn’t enough to explain away the difference in league strength:

I will present the rest of the charts with limited comment. This one is based on R/RA:

This set is based on gEW% as explained in this post. Basically, gEW% uses each team’s independent distributions of runs scored and runs allowed to estimate an overall W% based on the empirical winning percentage for teams scoring x runs in a game in 2012:

The last set is based on PW%--that is, runs created and runs created allowed run through Pythagenpat:

By this measure, two of the top four teams in the game didn’t even make the playoffs, and the third was unceremoniously dumped after one game.

I will now conclude this piece by digressing into some theoretical discussion regarding averaging these ratings. CTR return a result which is expressed as an estimated win ratio, which as I have explained is advantageous because these ratios are Log5-ready, which makes them easy to work with during and after the calculation of the ratings. However, the nature of win ratios makes anything based on arithmetic averages (including the average division and league ratings reported above) non-kosher mathematically.

These distortions are more apparent in the higher standard deviation of W% world (whether due to the nature of the sports or the sample size) of the NFL, so let me use those as an example. A 15-1 team and a 1-15 team obviously average to 8-8, which can be seen by averaging their wins, losses, or winning percentages. However, their respective win ratios of 15 and .07 average to 7.53.

Since the win ratios are intended to be used multiplicatively, the correct way to average in this case is to use the geometric average (*). For the NFL example above, the geometric average of the win ratios is in fact 1.

So here are the divisional ratings for actual wins based CTR using the geometric average rather than the arithmetic average:

The fact that all of the ratings decline is not a surprise; it is a certainty. By definition the geometric average is less than or equal to the arithmetic average. There really is no reason to use the arithmetic average other than laziness, which I have always found to be an unacceptable excuse when committing a clear sabermetric or mathematical faux pas (as opposed to making simplifying assumptions, working with a less-than-perfect model, or focusing on easily available and standardized data), and so going forward any CTR averages I report will be geometric means rather than arithmetic.

(*) The GEOMEAN function in Excel can figure this for you, but it’s really pretty simple. If you have n values and you want to find the geometric mean, take the product of all of these, then take the n-th root. The geometric average of 3, 4, and 5 is the cube root of (3*4*5). The Bill James Power/Speed number is the geometric average of home runs and stolen bases, although I don't think he realized it at the time it was introduced.