Walk Like a Sabermetrician: January 2012

Saturday, January 28, 2012

Crude Team Ratings, 2011

Anyone can throw together a spreadsheet and declare that they have a ranking system for teams. It’s not particularly hard to construct a reasonable method by which to take an initial estimate of team strength, adjust for strength of schedule, recalculate each team’s ranting, adjust for SOS again, rinse, repeat. I have done just that, and will present the 2011 ratings here.

If you want the full details, please refer to the linked post. The gist of the system is:

1) Start with a win ratio figure for each team. It could be actual win ratio, or an estimated win ratio.

2) Figure the average win ratio of the team’s opponents.

3) Adjust for strength of schedule, resulting in a new set of ratings.

4) Begin the process again. Repeat until the ratings stabilize.

The resulting figure is in the form of an adjusted win ratio; I force the average team to a rating of 100. The ratings can be plugged directly into an odds ratio--a team with a rating of 120 should win about 60% of the time against a team with a rating of 80 (120/(120 + 80)).

I’ll present four different sets of ratings here, each using a different win ratio as the input. It’s overkill to run this many, but if for some reason you prefer a certain estimate of win ratio, it may be represented.

Since 2011 is in the past, there’s no particular value in predictive ratings, so I’ll focus on the CTR based on actual wins and losses:

aW% is the adjusted W% based on CTR; SOS is the weighted average CTR of the team’s opponents; rk is the team’s ranking among the thirty teams; and s rk is the SOS rank.

The results aren’t particularly surprising; the teams are ranked pretty close to how they would be in W%. In some recent years, the results would favor AL teams much more than just looking at pure W%, but the National League held its own with the AL in 2011 as seen from the league/division ratings (simply the average rating for each member team):

That makes for a nice rank order of divisions, with East > West > Central, and AL > NL in each case. Still, the overall AL/NL rating difference of 103/97 is a lot smaller than previous seasons, including 108/93 in 2010. While the NL Central remained the weakest division, 89 was an improvement over the 82 rating in 2010. If Houston was in the AL rather than the NL (and assuming all the ratings stayed constant), the leagues would have each had a CTR of 100.

The next set of CTRs is based on Game Expected W% as described in this post. Basically, gEW% assumes independence between runs scored and runs allowed in a given game, and uses the 2011 empirical W% for teams scoring or allowing X runs in conjunction with each team’s actual game-by-game distribution of runs scored and runs allowed to estimate their W%. The resulting CTRs:

Using classic Pythagenpat as the input:

Finally, using Pythagenpat estimated win ratios based on runs created and runs created allowed:

Obviously there exist any number of possible combinations of win ratio estimates one could use, regression can be mixed in, etc. What I’ve presented here is just the most straightforward ratings based on obvious single inputs.

Tuesday, January 17, 2012

Run Distribution and W%, 2011

A couple of caveats apply to everything that follows in this post. The first is that there are no park adjustments anywhere. There's obviously a difference between scoring 5 runs at Petco and scoring 5 runs at Coors, but if you're using discrete data there's not much that can be done about it unless you want to use a different distribution for every possible context. Similarly, it's necessary to acknowledge that games do not always consist of nine innings; again, it's tough to do anything about this while maintaining your sanity.

All of the conversions of runs to wins are based only on 2011 data. Ideally, I would use an appropriate distribution for runs per game based on average R/G, but I've taken the lazy way out and used the empirical data for 2011 only.

This post also contains little in the way of "analysis" and a lot of tables. This is probably a good thing for you as the reader, but I felt obliged to warn you anyway. I’ve cut out a lot of what I listed last year simply because I don’t have that much free time right now. The data was not particularly useful in any event—knowing how many runs teams scored and allowed in their wins and losses, or what percentage of their games fell into arbitrarily defined classes might offer some trivia but is not exactly essential material.

The first breakout is record in blowouts versus non-blowouts. I define a blowout as a margin of five or more runs. This is not really a satisfactory definition of a blowout, as many five-run games are quite competitive--"blowout” is just a convenient label to use, and expresses the point succinctly. I use these two categories with wide ranges rather than more narrow groupings like one-run games because the frequency and results of one-run games are highly biased by the home field advantage. Drawing the focus back a little allows us to identify close games and not so close games with a margin built in to allow a greater chance of capturing the true nature of the game in question rather than a disguised situational effect.

In 2011, 75.8% of games were non-blowouts and 24.2% were blowouts. The teams sorted by non-blowout record:

The standard deviation of W% in non-blowouts was .064, which as expected is less than the standard deviation for blowouts (.114) and all games (.070).

Records in blowouts:

Obviously the sample size on these games is pretty small, but Kansas City and Oakland at .500 in blowouts caught my eye.

This chart shows blowout W% less non-blowout W%, along with the percentage of games that were blowouts and non-blowouts for each team:

This is the second year in a row in which San Diego has ranked high in terms of difference between blowout and non-blowout record. Usually teams with large differences are the better teams; that description may have fit the Padres in 2010 but not in 2011. Cleveland was the most extreme team in either direction in the majors. Florida played in the smallest proportion of blowouts while Texas played in the most.

A more interesting way to consider game-level results is to look at how teams perform when scoring or allowing a given number of runs. For the majors as a whole, here are the counts of games in which teams scored X runs:

The “marg” column shows the marginal W% for each additional run scored. The second and third run were both worth about .15 wins on average in 2011, while scoring four runs was the cutoff point between winning and losing (on average, of course).

I use these figures to calculate a measure I call game Offensive W% (or Defensive W% as the case may be), which was suggested by Bill James in an old Abstract. It is a crude way to use each team’s actual runs per game distribution to estimate what their W% should have been by using the overall empirical W% by runs scored for the majors in the particular season.

Using the empirical distribution rather than a theoretical distribution has the upside of being simple (modeling the runs per game distribution is fairly messy), but the benefits are outnumbered by the drawbacks. A non-comprehensive list of said drawbacks:

1. The empirical distribution is subject to sample size fluctuations. In 2011, at least, each additional run increased W%. This is often not the case given the low frequency of high scoring games. Even so, the marginal values don’t necessary make sense--for instance, the marginal value of a tenth run is implied to be .006 wins while the marginal value of an eleventh run is implied to be .040.

2. Using the empirical distribution forces one to use integer values for runs scored per game. Obviously the number of runs a team scores in a game is restricted to integer values, but not allowing theoretical fractional runs makes it very difficult to apply any sort of park adjustment to the team frequency of runs scored.

3. Related to #2 (really it’s root cause, although the park issue is important enough from the standpoint of using the results to evaluate teams that I wanted to single it out), when using the empirical data there is always a tradeoff that must be made between increasing the sample size and losing context. One could use multiple years of data to generate a smoother curve of marginal win probabilities, but in doing so one would lose centering at the season’s actual run scoring rate. On the other hand, one could split the data into AL and NL and more closely match context, but you would lose sample size and introduce quirks into the data.

I will not go into the full details of how gOW%, gDW%, and gEW% (which combines both into one measure of team quality) are calculated here, but full details were disclosed in this post. The “use” column here is the coefficient applied to each game to calculate gOW% while the “invuse” is the coefficient used for gDW%. For comparison, I have looked at OW%, DW%, and EW% (Pythagenpat record) for each team; none of these have been adjusted for park to maintain consistency with the g-family of measures which are not park-adjusted.

For most teams, gOW% and OW% are very similar. Teams whose gOW% is higher than OW% distributed their runs more efficiently (at least to the extent that the methodology captures reality); the reverse is true for teams with gOW% lower than OW%. The teams that had differences of +/- 2 wins between the two metrics were (all of these are the g-type less the regular estimate):

Positive: BAL, PIT, ATL, FLA, HOU, SEA
Negative: BOS, NYA, TEX, COL

You'll note that the positive differences tended to belong to bad offenses; this is a natural result of the nature of the game, and is reflected in the marginal value of each run as discussed above. In the four years that I’ve been looking at these figures, I can’t recall a difference as large as the Red Sox’ deviation in 2011--a standard OW% of .610 and a gOW% of .572, a 6.2 win difference. Boston led the majors in OW%; their gOW% was still excellent and good enough for third in the majors, but they did not spread their runs across games in an efficient fashion. The Sox scored ten or more runs 25 times; Toronto was second with 19 and the major league average was 9. Boston scored 36% of their runs in that 15% subset of games; the major league average was 15%, and next on the list was Texas at 28%.

Differences in for gDW%:

Positive: DET, BAL
Negative: PHI, SD, TB

I combine gOW% and gDW% through some Pythagorean math to produce gEW%, which can then be compared to a team’s standard Pythagorean record (EW%). Of course, it could also be compared to actual W%, but I think the comparison to a method that also uses runs is more interesting than a comparison to the actual win totals:

Positive: BAL, PIT, CHA, DET, MIN, HOU, OAK, FLA
Negative: BOS, PHI, COL, NYA, SD, TB, LA, KC

There are so many large differences that I’m a little worried that I may have made a spreadsheet error somewhere along the way, although I have double-checked and can’t find anything. Below is a table with all of the metrics discussed in this post for each team, sorted by gEW%:

Wednesday, January 04, 2012

Crude NFL Ratings, 2011

The NFL is a distant third on my list of pro sports interests (baseball is #1, of course, and horse racing ranks #2), but I’m interested enough to run the teams through my crude rating system (see explanation here) and figure I might post the ratings here. They are based on points/points allowed, adjusted for strength of schedule. 100 represents a win/loss ratio of 1, and so the resulting ratings are adjusted win ratios and can very easily be used to estimate the probability of a team winning a particular game. A team with a rating of 100 should beat a team with a rating of 50 2/3 of the time (100/(100 + 50)).

Actually, let me first run a list based on actual wins and losses. I’ve actually calculated W/L ratio as (W + .5)/(L + .5) here just to avoid the (real in the NFL) possibility of a 16-0 team crashing the system:

In the chart, aW% is an adjusted W%; it averages to .500 for the NFL and will produce the same list in rank order as the CTR; I prefer the latter because of its Log5 readiness, but aW% is a more meaningful unit. SOS is the weighted average of opponent’s strength of schedule. “rk” is the team’s rank in CTR, while “s rk” is the team’s rank in the SOS estimate.

I really do not care for the actual W% presentation for the NFL due to the short season magnifying differences in the teams. The Packers tower over the league here, which is appropriate given a 15-1 record against a decent schedule, but it doesn’t have any predictive value. You will notice in the table above that the NFC does quite well, which will be carry through to the points-based ratings:

Green Bay does not even rank #1 in the league; both New Orleans and San Francisco rank ahead of them. The top nine and eleven of the top fourteen teams made the playoffs, which is pretty good I think.

The aggregate ratings for the divisions (simply the average rating of the four teams) illustrates the superiority of the NFC and why I don’t care for micro-divisions:

Last year, the NFC West in turned in a ghastly 29 rating. Led by San Francisco, they were from the worst in the league, a distinction that went to their AFC brethren.

This whole exercise would be devoid of a great deal of entertainment value if I did not use the results to estimate Super Bowl probabilities. The disclaimer list here is lengthy enough that I will skip it less I leave anything out. A credibility adjustment would be pretty simple to implement (adding 12 games of a 100 rating would do the trick), but this is just NFL stats, not something important. The playoff odds do consider home field advantage; the home team’s rating is multiplied by 57/43 to reflect a fairly average NFL home field advantage. I feel bad about listing the probabilities to the thousandth place, but there are so many possible combinations for the championship games and Super Bowl that those tables would look silly without it:

Two road favorites on the first weekend is probably pretty typical given the quality of teams that often win micro-divisions (particularly those like the AFC West). The Denver Broncos simply aren’t a very good football team (it is tough for me to leave it at that, but piling on more snark re: you-know-who is beyond excessive at this point).

I like reseeding in theory, but when your initial seeding insists that Denver ranks #4 in the AFC because they are the sharpest scissors in the kindergarten classroom, it loses some of its luster.

Life is tough enough as a Browns fan without having to worry about horrors like a Denver/Cincinnati AFC title game, but thankfully there’s a 99.8% chance that will not come to pass. Pittsburgh/Baltimore, on the other hand, is the most likely championship game scenario that doesn’t involve either conference’s #1 seed.

Combining all of these, here are the playoff probabilities for each team:

The system still considers Green Bay the Super Bowl favorites even though they rank below New Orleans and San Francisco, thanks to favorable second round matchups and home field advantage, which is much more significant in the NFL playoffs than in MLB. Ratings and home field aside, if the NFC title game turns out to be Packers/Saints, I’m picking the latter to win it all. These probabilities add up to a 57% chance of the NFC representative winning the Super Bowl.