Thursday, January 31, 2013

MLB Has Lost Control of the Game

Many years ago, the powers that be in baseball sat down and drew up a set of rules to govern the game. These rules applied to the game on the field in an attempt to ensure that the game was contested in a sporting manner. However, an exclusive analysis performed by Walk Like a Sabermetrician regarding the 2012 MLB season reveals that these rules are routinely violated despite the penalties against them. Some may scoff at this analysis and say that some of these rules have not been broken intentionally. To do so is to coddle the rule breakers. These rules should have shaped the training of players and molded their behavior; instead, the players have failed to conform their physical actions to the sacred code of the game. Baseball is out of control, and it is out in the open for all to see, yet the powers that be have not taken any substantive steps to beef up enforcement. The shocking details:

* Pitchers have long been tasked with the simple job of providing a fair pitch to the batter, one that is within a zone deemed to be conducive to putting the ball in play. Pitchers have pushed the envelope, though, attempting to throw pitches that violate the spirit but not the letter of the rule. During the 2012 regular season, there were 14,709 separate occasions on which a pitcher failed to provide a hittable pitch and was penalized with a walk.

However, two additional details illustrate just how bad this problem has become. The first is that every single pitcher who logged a non-negligible number of innings issued at least one walk. The disregard for this rule and the attempt to deceive batters has infected the entire population of major league pitchers.

The second is that no fewer than 1,055 times did a pitcher intentionally violate this rule and make no attempt to provide a hittable pitch to the batter. This almost always occurred with expressed consent and even on direct order of the manager. Blatantly thumbing their nose at the code of the game, these pitchers and managers engaged in unsporting activity. The penalties simply must be increased to stamp out this behavior.

* Even more shockingly, there were 1,494 instances of a pitcher hitting a batter with a pitch. This act is expressly prohibited by the rules of baseball and the deleterious nature of this action is not limited to simply breaking rules. Hit batters have been linked to numerous cases of injury and even death. In choosing to play baseball, players should not be forced to make any decisions that could have an impact on their health, but batters risk extreme injury every game as they are forced to bat against these recalcitrant pitchers.

It has also become apparent that these violent acts are sometimes intentionally committed, often in a bizarre meld of revenge and tribal grudges that have more in common with gang warfare than gentlemanly sport. MLB has left the penalties for hit batters so toothless that these events continue, risking the health of batters and setting an awful example for the children of America.

* Any excuses regarding physical rather than moral failings go out the window when it comes to the matter of ejections. Umpires are given the power to remove disrespectful and violent offenders from the game. Such an awesome power should never have to be used in a civil game, but MLB’s product is anything but civil. 179 times an umpire had no choice but to remove a participant from the game for bad behavior. Again, the titular authority figures known as managers were frequently involved in these violations.

It is a matter of simple common sense that when rules are violated, it means that the associated penalties are insufficiently strong. This simple truth has been illustrated time and time again throughout human society. Any time draconian penalties are instituted, the associated behavior ceases. Examples include the lack of murders and non-existence of drug use in America, the strict adherence to all bylaws of the NCAA, and, of course, the complete lack of PED use in Olympic sport. MLB needs to learn from these examples and curb the culture of rule-breaking that prospers on the field.

Monday, January 21, 2013


* I don’t have anything of substance to add on the deaths of Earl Weaver and Stan Musial, but since both were favorites of mine I feel compelled to write a little something. Weaver of course is a managerial hero to many in the sabermetric community. He predated my time as a baseball fan by a significant number of years, but he still was an influence on me through the writing of Bill James and Thomas Boswell, his own book Weaver on Strategy, and Earl Weaver Baseball, which in its DOS form was the first baseball game I played (even though I’m sure he wasn’t writing the code).

The paragraph that follows is the kind of unsupported by evidence blurb I try to avoid writing, because in many cases you can get destroyed with a little cursory fact-checking. Weaver is famed for utilizing his roster to its fullest, particular his bench -- finding specialists whose strengths could help a club. The value of the bench has been greatly reduced in today’s game thanks to the roster crunch--the extra spots have gone to pitchers. Some of this is a natural result of the never-ending progression towards lighter pitching workloads, but some of it may be traceable to an attempt to counter the Weaver school. Once the bench was smartly utilized with specialists, it was necessary to have a counter-stocked bullpen. The verdict of the powers that be in baseball has been to prioritize stopping the other manager from gaining an offensive advantage through substitution rather than leaving one’s self with the tools to do so. One could argue that Tony LaRussa is the anti-Weaver in this regard.

Stan Musial was of course a great player, one who has gotten the short end of the stick--among his contemporaries on either bordering generation, there has been more celebration reserved for DiMaggio, Williams, Mantle, Mays, and Aaron among outfielders. He’s always been a favorite of mine, though--in fact, I have two framed baseball pictures hanging on my wall (the selection is more due to happenstance than a design to pick these particular two, but I wouldn’t hang a picture on my wall if I didn’t like the subject. One is of Babe Ruth’s sixtieth home run, and one is a picture of Stan Musial and various related Musial memorabilia (bats, uniform, ball, etc.)

* The various sports stories of last week (thankfully, not baseball-related) were a perfect reminder of why I have so little respect for sportswriters as a class (certainly I judge people as individuals, but it so happens that I have little use for the majority of individual sportswriters).

The best attribute of sportswriters is how ignorant they tend to be. When someone writes invective against sabermetrics, or displays a complete lack of understanding of statistics or economics or probability, it is easy to simply laugh them off. A huge number of sportswriters fall into this toss category.

As an aside, if I didn’t come to the table with a pre-conceived dim view of the world view held by most mainstream journalists (non-sports), it would be difficult for me to believe how ignorant they are. When I read news stories about topics on which I am well-informed, it is rare to go through an article that does not contain an outright falsehood, a statement of surprise at something that is blindingly obvious, or a quote from a clearly biased source that is allowed to pass without noting that bias. And when I see this occur in articles about a topic about which I know more than the journalist, it naturally gives me great pause about what I read about topics on which I am seeking to learn more.

Unlike many people inclined to interest in sabermetrics, I am not at all looking forward to the rapidly approaching day in which any aspiring young mainstream baseball scribe will be fluent in sabermetrics and not prone to dismissing non-traditional viewpoints. While this will have a limited positive effect of reducing the amount of idiocy we are all exposed to, it will make it that much harder to simply ignore a writer with cause.

My biggest problem with sportswriters is not that they are ignorant--it's that they are self-righteous, prone to pop psychology, and often downright nasty to their subjects. The new breed of baseball writer will still display all these traits, but without the casual ignorance of logic when it comes to strategy and evaluation of players. The perfect symbol of this new breed is Jeff Passan. Passan is as smarmy and as prone to being a jackass as your garden variety Murray Chass-era hack. But because Passan incorporates sabermetric statistics and thinking as appropriate, he is much more likely to get a pass for being a jerk than is a sabermetric ignoramus.

* The Armstrong and Te’o stories are also worthwhile as an illustration of how different my interest in sports is from the interest of the fictional public to which the stories are written. I have essentially zero interest in the private lives of athletes. I do not pick which athletes to root for because they seem like nice people, or because they have overcome some tragedy in their personal lives--I pick which athletes to root for because they play(ed) for/support the teams that I do, or because I enjoy watching them play the game.

When sportswriters single out a human interest story, it is their way of telling you who to root for. One could look at the roster of any Division I college football team with its 85 players and find someone who has been through a traumatic experience. Frankly, the notion that the death of a college-aged person’s grandmother would be a trauma worthy of making into a story is laughable. Certainly such an event is a terrible thing for the affected person and family, but it is also a fact of life and something that the majority of people in that age group have experienced. But journalists decided that Te’o was special, and that you should root for him--and they almost gave him an absurd Heisman trophy for it.

More broadly, I don’t care if Player X is a jerk to fans (and I certainly don’t care if he is a jerk to the media). I might care if I had any reason to interact with Player X--but I don’t, and the odds that I will ever interact with him are infinitesimal. Sportswriters are often incapable of realizing that the rest of us aren’t affected or interested by whatever inconveniences or petty issues Player X creates for them.

If I discover that Player X isn’t too friendly to fans who come up to him and ask for his autograph, I feel no reason to change my opinion of him. I know that when people who want something that I have approach me, I do my best to ignore them altogether (obviously no one is asking me for my autograph, but we all deal with panhandlers, charities calling for money, family members who need a favor, and the like). How can I fault Player X for behaving exactly as I would behave? Why would I factor this into the degree to which I like Player X, when the only reason I am even aware of his existence rather than that of another 1/7,000,000,000 of the world’s population is his ability to play baseball?

Of course, this can be easily looped back to the big baseball story of the month, the Hall of Fame voting. While the Hall of Fame was broken beyond repair before steroids made it a complete joke, the principle holds: I want to view baseball players as baseball players, nothing more and nothing less. Whatever else they are outside of that is of equal consequence to me as that of a mailman in Topeka.

Tuesday, January 15, 2013

Crude Team Ratings, 2012

For the last few years I have published a set of team ratings that I call “Crude Team Ratings”. The name was chosen to reflect the nature of the ratings--they have a number of limitations, of which I documented several when I introduced the methodology. Crude or not, this was a banner year in sabermetrics in which to be a purveyor of a team rating system, and I wouldn’t want to miss out on the fun.

The silliness of the Fangraphs power rankings and the eventual decision to modify them (while shifting blame to defensive shifts for odd results rather than the logic of the method) offered an opportunity to consider the nature of such systems. When you think about, team ratings actually may be the most controversial and important objective methods used in sports analysis. As sabermetricians it is easy to overlook this because they don’t play a large role in baseball analysis. But for college sports, rating systems are not just a way to draw up lists of teams--they help determine which teams are invited to compete for the national championship. And while most teams with a chance to win the championship in sports with large tournaments are comfortably in the field by any measure, in college football ranking systems are asked to make distinctions between multiple teams that would be capable of winning the title if permitted to compete for it.

There are any number of possible ways to define a team rating system, but to simply things I will propose two broad questions which should be asked before such a system is devised:

1. Do you wish to rank teams based on their bottom line results (wins and losses), or include other distinguishing factors (underlying performance, generally in terms of runs/runs allowed or predictors thereof)?

I would contend that if you are using team ratings to filter championship contenders, it is inappropriate to consider the nature of wins and losses, only the binary outcomes. If you are attempting to predict how teams will perform in the future, then you’d be a fool not to consider other factors.

2. Do you wish to incorporate information about the strength of the team at any given moment in time, or do you wish to evaluate the team on its entire body of work?

I would contend that for use as a championship filter, the entire body of work should be considered, with no adjustments made for injuries, trades, performance by calendar, etc. If you are using ratings to place bets, then ignoring these factors means that you must consider sports books to be the worthiest of charities.

Obviously my two questions and accompanying answers painted in broad strokes. But defining what you are setting out to measure in excessively broad strokes is always preferable to charging ahead with no underlying logic and no attempt to justify (or even define) that logic. Regardless of how big your website is, how advanced your metrics are, how widely used your underlying metric is for other purposes, how much self-assuredness you make your pronouncements with, or who is writing the blurbs for each team, if you don’t address basic questions of this nature, your final product is going to be an embarrassing mess. Fangraphs learned that the hard way.

For the two basic questions above, CTR offers some flexibility on the first question. It can only use team win ratio as an input, but that win ratio can be estimated. In this post I’ll present four variations--based on actual wins and losses, based on runs/runs allowed, based on game distribution-adjusted runs/runs allowed, and based on runs created/runs created allowed. You could think up other inputs or any number of permutations thereof (such as actual wins/losses regressed 25% or a weighted average of actual and Pythagorean record, etc.). On the second question, CTR has no real choice but to use the team’s entire body of work.

I explained how CTR is figured in the post linked at the top of this article, but in short:

1) Start with a win ratio figure for each team. It could be actual win ratio, or an estimated win ratio.

2) Figure the average win ratio of the team’s opponents.

3) Adjust for strength of schedule, resulting in a new set of ratings.

4) Begin the process again. Repeat until the ratings stabilize.

First, CTR based on actual wins and losses. In the table, “aW%” is the winning percentage equivalent implied by the CTR and “SOS” is the measure of strength of schedule--the average CTR of a team’s opponents. The rank columns provide each team’s rank in CTR and SOS:

While Washington had MLB’s best record at 98-64, they only rank fifth in CTR. aW% suggests that their 98 wins was equivalent to 95 wins against a perfectly balanced schedule, while the Yankees’ 95 wins was equivalent to 99 wins.

Rather than comment further about teams with CTRs that diverge from their records, we can just look at the average CTR by division and league. Since schedules are largely tied to division, just looking at the division CTRs explains most of the differences. A bonus is that once again they provide an opportunity to take gratuitous shots at the National League:

This is actual a worse performance for the NL than in 2011. Going back to 2009, the NL’s CTR has been 89, 93, 97, 89. The NL Central remained the worst division, dragged down to a dismal rating of 81 by the Astros and Cubs and the lowest divisional rating I’ve encountered in the four years I’ve figured these ratings. This explains why the best teams in the NL Central have the lowest SOS figures. 2012 marks the first time that AL East has not graded out as the top division in those four seasons, although its 121 CTR is higher than for some of the years in which it ranked #1.

This year it may be worth considering how this breakout would look if Houston was counted towards the ratings for the AL (West) as they will in 2013. It helps the NL cause, naturally, but it isn’t enough to explain away the difference in league strength:

I will present the rest of the charts with limited comment. This one is based on R/RA:

This set is based on gEW% as explained in this post. Basically, gEW% uses each team’s independent distributions of runs scored and runs allowed to estimate an overall W% based on the empirical winning percentage for teams scoring x runs in a game in 2012:

The last set is based on PW%--that is, runs created and runs created allowed run through Pythagenpat:

By this measure, two of the top four teams in the game didn’t even make the playoffs, and the third was unceremoniously dumped after one game.

I will now conclude this piece by digressing into some theoretical discussion regarding averaging these ratings. CTR return a result which is expressed as an estimated win ratio, which as I have explained is advantageous because these ratios are Log5-ready, which makes them easy to work with during and after the calculation of the ratings. However, the nature of win ratios makes anything based on arithmetic averages (including the average division and league ratings reported above) non-kosher mathematically.

These distortions are more apparent in the higher standard deviation of W% world (whether due to the nature of the sports or the sample size) of the NFL, so let me use those as an example. A 15-1 team and a 1-15 team obviously average to 8-8, which can be seen by averaging their wins, losses, or winning percentages. However, their respective win ratios of 15 and .07 average to 7.53.

Since the win ratios are intended to be used multiplicatively, the correct way to average in this case is to use the geometric average (*). For the NFL example above, the geometric average of the win ratios is in fact 1.

So here are the divisional ratings for actual wins based CTR using the geometric average rather than the arithmetic average:

The fact that all of the ratings decline is not a surprise; it is a certainty. By definition the geometric average is less than or equal to the arithmetic average. There really is no reason to use the arithmetic average other than laziness, which I have always found to be an unacceptable excuse when committing a clear sabermetric or mathematical faux pas (as opposed to making simplifying assumptions, working with a less-than-perfect model, or focusing on easily available and standardized data), and so going forward any CTR averages I report will be geometric means rather than arithmetic.

(*) The GEOMEAN function in Excel can figure this for you, but it’s really pretty simple. If you have n values and you want to find the geometric mean, take the product of all of these, then take the n-th root. The geometric average of 3, 4, and 5 is the cube root of (3*4*5). The Bill James Power/Speed number is the geometric average of home runs and stolen bases, although I don't think he realized it at the time it was introduced.

Tuesday, January 08, 2013

Run Distribution and W%, 2012

A couple of caveats apply to everything that follows in this post. The first is that there are no park adjustments anywhere. There's obviously a difference between scoring 5 runs at Petco and scoring 5 runs at Coors, but if you're using discrete data there's not much that can be done about it unless you want to use a different distribution for every possible context. Similarly, it's necessary to acknowledge that games do not always consist of nine innings; again, it's tough to do anything about this while maintaining your sanity.

All of the conversions of runs to wins are based only on 2012 data. Ideally, I would use an appropriate distribution for runs per game based on average R/G, but I've taken the lazy way out and used the empirical data for 2012 only.

The first breakout is record in blowouts versus non-blowouts. I define a blowout as a margin of five or more runs. This is not really a satisfactory definition of a blowout, as many five-run games are quite competitive--"blowout” is just a convenient label to use, and expresses the point succinctly. I use these two categories with wide ranges rather than more narrow groupings like one-run games because the frequency and results of one-run games are highly biased by the home field advantage. Drawing the focus back a little allows us to identify close games and not so close games with a margin built in to allow a greater chance of capturing the true nature of the game in question rather than a disguised situational effect.

In 2012, 74.9% of games were non-blowouts (and thus 25.1% were blowouts). Here are the teams sorted by non-blowout record:

Records in blowouts:

This chart is sorted by differential between blowout and non-blowout W% and also displays blowout/non-blowout percentage:

As you can see, the Phillies had the highest percentage of non-blowouts (and also went exactly .500 in both categories) while the Angels had the highest percentage of blowouts. This is the second consecutive season in which Cleveland has had the most extreme W% differential (in either direction). Coincidentally, both pennant winners were better in non-blowouts by the same -.012.

A more interesting way to consider game-level results is to look at how teams perform when scoring or allowing a given number of runs. For the majors as a whole, here are the counts of games in which teams scored X runs:

The “marg” column shows the marginal W% for each additional run scored. In 2012, the fourth run was both the most marginally valuable and the cutoff point between winning and losing (on average).

I use these figures to calculate a measure I call game Offensive W% (or Defensive W% as the case may be), which was suggested by Bill James in an old Abstract. It is a crude way to use each team’s actual runs per game distribution to estimate what their W% should have been by using the overall empirical W% by runs scored for the majors in the particular season.

Theoretical run per game distribution was a major topic on this blog in 2012, and so I will digress for a moment and talk about what I found. The major takeaway is that a zero-modified negative binomial distribution provides a pretty good model of runs per game (I called my specific implementation of that model Enby so that I didn’t have to write “zero-modified negative binomial” a hundred times, but that’s what it is. This is important to point out so that I don’t 1) give the impression that I created a unique distribution out of thin air and 2) to assure you that said distribution is a real thing that you could read about in a textbook).

However, the Enby distribution is not ready to be used to estimate winning percentages. In order to use Enby, you have to estimate the three parameters of the negative binomial distribution at a given R/G mean. I do this by estimating the variance of runs scored and fudging (there is no direct way to solve for these parameters, at least that is published in math journals that I can make heads or tails of). The estimate of variance is quite crude, although it appears to work fine for modeling the run distribution of a team independently. But as Tango Tiger has shown in his work with the Tango Distribution (which considers the runs per inning distribution), the distribution must be modified when two teams are involved (as is the case when considering W%, as it simultaneously involves the runs scored and allowed distribution). I have not yet been able to apply a similar corrector in Enby, although I have an idea of how to do so which is on my to-do list. Perhaps by the time I look at the 2013 data, I’ll have a theoretical distribution to use. Here are three reasons why theoretical would be superior to empirical for this application:

1. The empirical distribution is subject to sample size fluctuations. In 2012, teams that scored 11 runs won 96.9% of the time while teams that scored 12 runs won 95.9% of the time. Does that mean that scoring 11 runs is preferable to scoring 12 runs? Of course not--it's a small sample size fluke (there were 65 games in which 11 runs were scored and 49 games in which 12 runs were scored). Additionally, the marginal values don’t necessary make sense even when W% increases from one runs scored level to another--for instance, the marginal value of a ninth run is implied to be .030 wins while the marginal value of an tenth run is implied to be .063 wins. (In figuring the gEW% family of measures below, I lumped all games with 11+ runs into one bucket, which smoothes any illogical jumps in the win function, but leaves the inconsistent marginal values unaddressed and fails to make any differentiation between scoring 20 runs and scoring 11).

2. Using the empirical distribution forces one to use integer values for runs scored per game. Obviously the number of runs a team scores in a game is restricted to integer values, but not allowing theoretical fractional runs makes it very difficult to apply any sort of park adjustment to the team frequency of runs scored. (Enby doesn’t allowed for fractional runs either, which makes sense given that runs are indeed discrete, but you can park adjust Enby by park adjusting the baseline).

3. Related to #2 (really its root cause, although the park issue is important enough from the standpoint of using the results to evaluate teams that I wanted to single it out), when using the empirical data there is always a tradeoff that must be made between increasing the sample size and losing context. One could use multiple years of data to generate a smoother curve of marginal win probabilities, but in doing so one would lose centering at the season’s actual run scoring rate. On the other hand, one could split the data into AL and NL and more closely match context, but you would lose sample size and introduce quirks into the data.

Before leaving the topic of the Enby distribution, I was curious to see how it performed in estimating the major league run distribution for 2012. The major league average was 4.324 R/G, which corresponds to Enby distribution parameters of (r = 4.323, B = 1.0116, z = .0594). This graph truncates scoring at 15 runs per game to keep things manageable, and there’s very little probability in the far right tail:

From my (admittedly biased) vantage point, Enby does a fairly credible job of estimating the run scoring distribution. Enby is too low on zero and one run and too high on 2-4 runs, which is fairly common and thus an area for potential improvement to the model.

I will not go into the full details of how gOW%, gDW%, and gEW% (which combines both into one measure of team quality) are calculated here, but full details were provided here. The “use” column here is the coefficient applied to each game to calculate gOW% while the “invuse” is the coefficient used for gDW%. For comparison, I have looked at OW%, DW%, and EW% (Pythagenpat record) for each team; none of these have been adjusted for park to maintain consistency with the g-family of measures which are not park-adjusted.

For most teams, gOW% and OW% are very similar. Teams whose gOW% is higher than OW% distributed their runs more efficiently (at least to the extent that the methodology captures reality); the reverse is true for teams with gOW% lower than OW%. The teams that had differences of +/- 2 wins between the two metrics were (all of these are the g-type less the regular estimate):

Positive: CIN, DET, KC, MIA
Negative: BOS, TEX, ARI, OAK

Last year, the Red Sox gOW% was 6.2 wins lower than their OW%, which is by far the highest I’ve seen since I started tracking this. Boston once again led the majors in this department, but only with a 2.5 win discrepancy. Of course, last year their gOW% was a still-excellent .572, while this year it was down to a near average .507.

As I’ve noted in an earlier post, Cincinnati’s offense was much worse than one would have expected given the names in the lineup and their recent performances. Historically bad leadoff hitters certainly didn’t help, but on the bright side, the Reds distributed their runs as efficiently as any team in MLB. CIN had a .479 OW% (which would be a little lower, .470, if I was park-adjusting), but their .498 gOW% was essentially league average. To see how this came about, the graph below considers Cincinnati’s runs scored distribution, the league average for 2012, and the Enby distribution expectation for a team averaging 4.15 runs per game (CIN actually averaged 4.13). The graph is cutoff at 15 runs; the Reds highest single game total
was 12:

The Reds were shutout much less frequently than an average team (or the expectation for a team with their average R/G), but they gave up much of this advantage by scoring exactly one run more frequently than expected. In total, CIN scored one or fewer runs 16.7% of the time, compared to a ML average of 17.4% and Enby expectation of 17.8%. They were also scored precisely two runs less than expected. Where Cincinnati made hay was in games of moderate runs scored--the Reds exceeded expectations for 3, 4, 5, and 6 runs scored. As you can see if you look at the chart from earlier in the post, the most valuable marginal runs in 2012 were 2-4, for the Reds did a decent job of clustering their runs in the sweet spot where an extra run can have a significant impact on your win expectancy.

From the defensive side, the biggest differences between gDW% and DW% were:

Positive: TEX, CHN, BAL
Negative: MIN, WAS, TB, NYA, CIN

The Reds and the Rangers managed to offset favorable/unfavorable offensive results with the opposite for defense. For the Twins to have the largest negative discrepancy was just cruel, considering that only COL (.386) and CLE (.411) had worse gDW%s than Minnesota’s .418. In gDW%, Minnesota’s .400 was better only than Colorado’s .394, a gap that would be wiped out by any reasonable park adjustment.
gOW% and gDW% are combined via Pythagenpat math into gEW%, which can be compared to a team’s standard Pythagenpat record:

Positive: DET, CHN, KC, NYN, BAL, MIA
Negative: MIN, ARI, OAK, WAS, STL

The table below is sorted by gEW%:

Tuesday, January 01, 2013

Crude NFL Ratings, 2012

Since I have a ranking system for teams and am somewhat interested in the NFL, I don’t see any reason not to take a once a year detour into ranking NFL teams (even if I’d much rather I have something useful to contribute regarding the second best pro sport, thoroughbred racing).

As a brief overview, the ratings are based on win ratio for the season, adjusted over the course of several iterations for opponent’s win ratio. They know nothing about injuries, about where games were played, about the distribution of points from game to game; nothing beyond the win ratio of all of the teams in the league and each team’s opponents. The final result is presented in a format that can be directly plugged into Log5. I call them “Crude Team Ratings” to avoid overselling them, but they tend to match the results from systems that are not undersold fairly decently.

First, I’ll offer ratings based on actual wins and losses, but I would caution against putting too much stock in them given the nature of the NFL. Ratios of win-loss records like 2-14 and 15-1 which pop up in the NFL are not easily handled by the system. In order to ensure that there are no divide by zero errors, I add half a win and half a loss to each team’s record. This is not an attempt at regression, which would require much more than one game of ballast. This year the most extreme records were 2-14 and 13-3, so the system produced fairly reasonable results:

In the table, aW% is an adjusted W% based on CTR. The rank order will be exactly the same, but I prefer the CTR form due to its Log5 compatibility. SOS is the average CTR of a team’s opponents, rk is the CTR tank of each team, and s rk is each team’s SOS rank.

The rankings that I actually use are based on a Pythagorean estimated win ratio from points and points allowed:

Seattle’s #1 ranking was certainly a surprise, but last year Seattle’s 92 CTR ranked 13th in the league, reflecting a little better than their 7-9 record. When I have posted weekly updates on Twitter, I’ve gotten a few comments on the high ranking of the Bears. CTR may like Chicago more than some systems, but comparable systems with comparable inputs also hold them in high regard. Wayne Winston ranks them #5; Andy Dolphin #7; Jeff Sagarin #7; and Football-Reference #6. Chicago ranked sixth in the NFL in P/PA ratio, which is the primary determinant of CTR, and played an above-average schedule (they rank 10th in SOS at 116, which means that their average opponent was roughly as good as the Vikings). The NFC North was the second-strongest division in the league, with Green Bay ranking #6, Minnesota #9, and Detroit #17. They played the AFC South, which didn’t help, although it was marginally better for SOS than playing the West. Their interdivisional NFC foes were Arizona (#24), Carolina (#16), Dallas (#19), Seattle (#1), San Francisco (#3), and St. Louis (#13) which is a pretty strong slate.

Obviously the Bears did not close the season strong, but the system doesn’t know the sequence of games and weights everything equally. Still, their losses came to #1 Seattle, #3 San Francisco, twice to #6 Green Bay, #7 Houston, and #9 Minnesota. I didn’t check thoroughly, but I believe that no other team save Denver was undefeated against the bottom two-thirds of the league (the Broncos’ losses came to #2 New England, #7 Houston, and #8 Atlanta). Even the other top teams had worse losses--for instance, Seattle and New England both lost to #24 Arizona, San Francisco lost to #13 St. Louis, Green Bay and Houston lost to #23 Indianapolis, and Atlanta lost to #20 Tampa Bay.

Last year I figured the CTR for each division and conference as the arithmetic average of the CTRs of each member team, but that approach is flawed. Since the ratings are designed to be used multiplicatively, the geometric average provides a better means of averaging. However, given the properties of the geometric average, the arithmetic average of the geometric averages does not work out to the nice result of 100:

The NFC’s edge here is huge--it implies that the average NFC team should win 64% of the time against an average AFC team. The actual interconference record was 39-25 in favor of the NFC (.609). The NFC’s edge is naturally reflected in the team rankings; 7 of the top 10 teams are from the NFC with 7 of the bottom 8 and 10 of the bottom 12 from the AFC.

This exercise wouldn’t be a lot of fun if I didn’t use it to estimate playoff probabilities. First, though, we need regressed CTRs. This year, I’ve added 12.2 game of .500 to each team’s raw win ratio based on the approach outlined here. That produces this set of ratings, which naturally result in a compression of the range between the top and bottom of the league, and a few teams shuffling positions:

The rank orders differ not because the regression changes the order of the estimated win ratios fed into the system (it doesn’t), but because the magnitude of the strength of schedule adjustment is reduced.

Last year I included tables listing probabilities for each round of the playoffs, but I will limit my presentation here to the first round and the probabilities of advancement. After each round of the playoffs, the CTRs should be updated to reflect the additional data on each team, and thus the extensive tables will be obsolete (although I will share a few nuggets). This updating might not be particularly important for MLB, since a five or seven game series adds little information when we already have a 162 game sample on which to evaluate a team. But for the more limited sample available for the NFL, each new data point helps.

In figuring playoff odds, I assume that having home field advantage increases a team’s CTR by 32.6% (this is equivalent to assuming that the average home W% is .570). Here is what the system thinks about the wildcard round:

The home team is a solid favorite in each game except for Washington, which faces the top-ranked team in the league. Houston is the weakest favorite; the Texans would be estimated to have a 54% chance on a neutral field and 47% at Cincinnati.

The overall estimated probabilities for teams to advance to each round are as follows:

San Francisco, Denver, and New England are all virtually even at 20% to win the Super Bowl. The Patriots are the highest ranked of the three, but San Francisco benefits from the weak NFC and Denver from home field advantage. CTR would naturally pick Seattle to win it all if they weren’t at a seeding disadvantage; however, their probability of winning the Super Bowl given surviving the first round is 14%, greater than Atlanta’s 12%.

The most likely AFC title game is Denver/New England (48% chance), with Denver given a 54% chance to win (it would be 47% on a neutral field and 40% at New England); the least likely AFC title game is Indianpolis/Cincinnati (1% chance). The most likely NFC title game is Atlanta/San Francisco (34%), with a 53% chance of a 49ers road win; the least likely matchup is Washington/Minnesota (2%). The most likely Super Bowl matchup is Denver/San Francisco (14% likelihood and 54% chance of a 49er win); the least likely is Indianapolis/Washington (.1%). The NFC is estimated to have a 51% chance of winning the Super Bowl, lower than one might expect given the NFC’s dominance in the overall rankings. However, the NFC’s best team has to win three games on the road (barring a title game against Minnesota) while the probability of New England or Denver carrying the banner for the AFC is estimated to be 77%.

Of course, all of these probabilities are just estimates based on a fairly crude rating system, and last year the Giants were considered quite unlikely to win the Super Bowl (although I didn’t regress enough in calculating the playoff probabilities last year, resulting in overstating the degree of that unlikelihood).