Walk Like a Sabermetrician: Crude Team Ratings 2010

In a previous post I explained the methodology behind these rankings, and acknowledged a fairly decent number of shortcomings they possess. I will not harp on either of those topics again here, but that is simply to avoid being repetitive--these rankings are far from perfect and should be taken in that spirit.

I have will present four different sets of rankings based on four different inputs. The manner of calculating the rankings is identical for all four; the only difference is which initial win ratio is used. I tend to believe the final set based on Runs Created/Runs Created Allowed is the most indicative of true talent, but any season aggregate metric is going to have obvious deficiencies when it comes to estimating true talent, and in reality some combination would probably be superior for that purpose.

The first ranking is what I call actual CTR, since it is based on the actual W/L ratio of each team. This is an attempt to evaluate teams based on their actual game outcomes, just adjusted for strength of schedule. Some might argue that even if actual record is used for the team, some component-estimated record should be used to gauge SOS, and they have a point, but this approach equates team strength to W/L ratio.

In the chart below, aW% is adjusted W%, and "s rk" is the rank of the team's SOS (#1 = toughest schedule):

You can see from the chart that a 100 CTR does not correspond to a .500 aW%. This is because the rankings are designed to give the average team a 100 CTR; since the properties of W/L ratio ensure that the average W/L will be > 1, an average W/L ratio is not the same time as the W/L ratio of an average team (which is 1, of course). If this scale distortion bothers you, use aW%--and stop using ERA+, because the distortion is similar. I am unconcerned about the issue because I want the average rating (but not necessarily the median) to be 100.

The three toughest schedules are the bottom three in the AL East (BAL, TOR, BOS, although it's hardly fair to refer the latter two teams as being at the bottom of anything). Of course, this is a consequence of playing in the stronger league and having to play a bunch of games against the two highest-rated teams. What I don't like about this definition of SOS is that one can make the case that Tampa Bay's schedule was equally as tough on paper as Toronto's (assuming an equal distribution of games against non-AL East opponents)--but Tampa Bay's success in winning games makes Toronto's harder in practice. It would be difficult to devise a SOS technique that took that point of view, however.

The more notable weakness of the schedule-adjustment (and thus the ratings themselves) is they make no correction for the influence that a team has upon the win-loss record of its opponents, and thus might be acting in a distortive manner at the extremes.

I also have some division/league ratings; these are simply the average CTR of all of the teams in the division:

Four of the six divisions were relatively equal, with an average CTR in the 96-102 range. However, one extremely good division and one extremely bad division result in the AL having a 108 ranking to the NL's 93. Those overall league rankings imply that the average AL team would be expected to have a .537 W% against the average NL team. The 2010 interleague record for the AL was .532, which of course was not generated through balanced schedule, neutral-field meetings and covers a sample of 252 games.

While I developed the rankings this year, I ran 2009 through the spreadsheet and the AL/NL disparity was estimated to be greater (113/89, .559). The AL West was the top-rated division (126), and the NL Central fared a tick worse than they would in 2010 (80).

The weak NL Central allowed the Reds to have the worst CTR of any playoff team (108) in 2010, although that figure was also better than the Cardinals' 2009 low of 103.

Switching gears, here are the gCTR figures. These are based on gEW% (described in this post), which takes into account the distribution of team runs scored and allowed per game (but does so independently of the other):

I'm not going to have a lot to say about the charts for each input, since they track the differences between their inputs and actual W%, which I have already written about in one form or another. Next is eCTR, which uses standard Pythagenpat W-L record as its starting point:

Finally, pCTR, based on Runs Created/Runs Created Allowed used to fuel Pythagenpat:

Here the AL East really looks good, with the top four teams in baseball. The potential for this type of clustering of strong teams (and, in the case of the NL Central, lousy teams) in one division is one of the reasons I oppose treating winners of small divisions playing with unbalanced schedules as sacrosanct.

pCTR and the associated aW% are the closest in construction to other popular ratings that account for strength of schedule and use component rather than actual W-L inputs, namely the third-order records published by Baseball Prospectus and the TPI rankings figured by Justin at Beyond the Box Score. Here are the most comparable winning percentages for each methodology--my aW% based on pCTR, the third-order winning percentage from BP, and the TPI from Justin:

You can see that there is general agreement between all of the methods, which is a good sign. The best correlation is between CTR and BP (+.983); the worst between CTR and Justin (+.942), with the BP/Justin correlation falling in the middle (+.965). The methods are all in general agreement about the proper spread of the teams--the standard deviation of the BP and Justin figures is .059 compared to .060 for the CTR-based figures, .060 for my estimate of PW% without any schedule adjustments, and .068 for actual W%.

The BP approach and my own to generating the underlying W% estimate are essentially the same, except for the use of different run estimators (BP uses EqR and I use Base Runs). Justin’s approach is a little different, and I’m personally not wild about it--it breaks defense down into pitching and fielding, making use of FIP and defensive metrics like UZR and Dewan’s Runs Saved. The approach used by BP and myself looks at the actual total component statistics surrendered by the defense. This does not allow one to split defense into pitching and fielding, but it also makes use of the actual observed interaction between the two on the field rather than using estimates that might make sense in isolation but leave something missing when the two are considered as one unit.

In any event, it is encouraging to see that CTR is able to produce similar results to systems developed by others that have been around a little or a lot longer as the case may be. If CTR returned very different results, I would probably conclude that it had a serious methodological error rather than the minor though not insignificant flaws that I am already aware of.