Walk Like a Sabermetrician: Putting Pitcher W-L Records in Context

It seems as if this is a post that I write in one form or another every year. As such it may seem as if it is a topic that captivates me, but it really doesn't: I don't put the resulting figures to much if any use in my own comparisons of pitchers. Still, there are some points that I think are simple, yet oft-ignored, and as such they bear occasional repeating.

Mainstream fans and writers remain invested in the notion that win-loss records for individual pitchers are important; they certainly don't value them as much as their fathers did twenty years ago, but W-L records have yet to be completely discarded as a comparative tool and likely will not be for some time. Accepting that this is the case, how can one go about using them in the most effective way?

It should be obvious to everyone, not just sabermetricians, that W-L records include a lot of noise from factors other than the pitcher's performance, most importantly the performance of their team's offense when they happen to pitch. Performance by relievers and fielders certainly play a role as well, but data on those aspects of the game is scarcer, particularly as one moves back in time.

Historical run support data is available, but it is often ignored in favor of just looking at the team's overall winning percentage (usually the team's W% when the given pitcher does not get the decision). The pros of this approach are:

1) it is simple
2) it does (but only to some limited extent, largely drowned out by other noise) capture the bullpen and fielder effects that are ignored by run support alone
3) it allows one to dispense with league and park adjustments since W% is always anchored at .500

The cons are (beyond the issues that W-L record already brings to the table):

1) it does not isolate performance when the pitcher actually pitches; some will receive lousy run support despite pitching for good offensive teams
2) it allows the performance of the team's other pitchers to greatly effect the comparison; the classic example is that it was difficult for a Steve Avery to exceed the performance of Greg Maddux, Tom Glavine, and John Smoltz. In fact, comparing a pitcher's W% to that of his teammates implicitly assumes that all of the deviation from .500 observed in teammate W% is attributable to factors that benefited the pitcher in question to the same degree.

On the career level, some of these factors will wash out to some extent. A pitcher with a long career is more likely than not to pitch for teams whose deviations from .500 are somewhat balanced between being caused by runs scored and runs allowed. Over the course of a pitcher's career, it's likely that his run support will be about the same as his team's average runs scored.

So using teammate W% (which I'll call Mate, for kicks if nothing else) should be a reasonable approximation for a more in-depth examination of run support (and bullpen/fielding support), and should give us a better read on pitcher value than just looking at W-L without any adjustments whatsoever.

At this point, the natural inclination of most people is to simply subtract Mate from W%. This is what Ted Oliver did in his Weighted Rating System, and it's what Neft and Cohen did in early editions of the late, great Sports Encyclopedia: Baseball. However, given the fact that the average team is going to deviate from .500 in equal parts due to its offense and defense (defense being defined as pitching + fielding), doesn't it make sense to remove the defensive part of the deviation?

Of course, some of the team's defensive performance does move the baseline expectation for an individual pitcher from .500. However, since any metric based on W-L is going to be somewhat crude by its nature, let's just assume that the half of the team deviation from .500 that is attributable to its defense should be removed altogether for the purpose of setting the baseline W% for the pitcher in question. Let's also keep it simple and assume that team W% is a linear function of runs and runs allowed. Then we can say that:

Expected W% for pitcher = (Mate - .5)/2 + .5

What this does is take half of the team's deviation from .500 (the half that we are crediting to the offense) to estimate the W% of an average pitcher placed on this team. We can simply that equation to Mate/2 + .25. Thus, an average pitcher on a .500 team is expected to go .5/2 + .25 = .500, of course. On a .600 team, he should have a .550 W%.

This approach will actually lessen the strength of the adjustment for pitchers on non-.500 teams; if you simply compare W% to Mate with no adjustment, you will give a greater boost to pitchers on losing teams and a take a larger bite out of the records of pitchers on winning teams. However, I believe this adjustment (originally proposed by Rob Wood in By the Numbers) is more useful for the reasons discussed above.

Since fielding and relievers do play a part in determining the W% of an individual pitcher, we could make amount of regression something less than 50% to attempt to capture that--we could make it 40% towards .500, or 45%, or whatever value you'd like to make a case for. 50% is simple and easy to explain, though, and chasing precision in a metric built on W-L record is a fool's errand.

With this in place, we can figure Neutral W% (the W% we expect for the pitcher given that he was on a .500 team) as:

NW% = W% - Mate/2 - .25 + .5 = W% - Mate/2 + .25

A .600 pitcher on a .600 team will be given a NW% of .550, rather than .500 under the Oliver approach.

With NW% in place, it is a snap to figure wins above a baseline. Wins Above Team, used by Thorn and Palmer to denote wins above a .500 pitcher placed on the team, is:

WAT = (NW% - .5)*(W + L)

One could use IP/9 or some other method for neutralizing decisions, but I have decided to assume that the pitcher would receive the same number of decisions regardless of the quality of team he pitched for. This may not be true or completely fair, but I think it's close enough and it preserves simplicity.

Of course, any baseline can be substituted for .5 (average); I personally use .390 for replacement level, so it is very easy to figure Wins Compared to Replacement as:

WCR = (NW% - .39)*(W + L)

Now let's actually put this to use and talk about everybody's favorite pitchers, Bert Blyleven and Jack Morris. If we consider their records without any sort of adjustment for Mate, we have:

Morris' W% is higher, and he looks a lot better when compared to a .500 baseline. If you compare to a replacement baseline, it's pretty even--Blyleven's extra 97 decisions enable him to make up the gap in percentage.

When we consider the performance of their teammates, things will get a little bit closer. Morris' teams had a .538 W% when he did not get a decision (Mate); Blyleven's .495. That results in these neutral records:

Morris still has a higher NW%, but Blyleven is closer when compared to .500 and has the lead in WCR. Of course one can argue about the proper baseline for Hall of Fame comparisons, but I think it's fair to say that, when viewed in this light, neither pitcher clearly distinguishes himself from the other on the basis of W-L record.

I have posted a spreadsheet with career records for predominantly post-1900 pitchers who did not pitch in 2009 (with the exception of Randy Johnson, who announced his retirement). I believe that the list includes every 150 game winner during that period, in addition to a number of other pitchers who won 100 or more games. A technical note: Mate, NW%, WAT, and WCR are figured on a year-by-year basis, weighted by the pitcher's decisions in that season.

The 150 win pitchers with the highest Mate:

I'm sure someone will leave an anonymous comment complaining that I called him Miner Brown, like I did when I referred to "Hans" Wagner. I could have used Mordecai; Three Finger would force the first name column to be wider. Digression aside, all of these guys pitched significant portions of their careers for dynastic teams. Gomez had the best teammates by far, and is also a poster child for why the Wood approach is superior to the Oliver approach. Gomez beat his teammates' W% by just thirteen percentage points. Figured traditionally, he is only +4 WAT.

The 150 win pitchers with the lowest Mate:

As you know, Rick Reuschel is a pitcher who has been very much overlooked; Walter Johnson won over 400 games despite pitching for .460 teams without him; and Jack Powell is a good litmus test for just how strong your preference for value compared to replacement is.

Once again, here is the link to the career spreadsheet.