Wednesday, December 28, 2005

Rate Stat Series, pt. 4

Perhaps the title for this series is a misnomer, because we will begin to leave the exclusive realm of rate stats to talk about value stats that are generated from them. “Offensive Evaluation” series will probably suit it better, and I think I’ll use that from now on. If you are constructing an ability stat, it is probably going to be a rate stat. True, there is an ability to stay in the lineup that you may want to account for, but outside of this, ability is without question a rate. The rate ideally will be expressed with some sort of margin of error or confidence interval, because it is just an estimate of the player’s true ability. But the fundamental thing you need is a rate.

A value stat, looking-backwards, needs to have playing time as a component. In this segment, we will look at a method of estimating value based on a rate stat (R/O) which we have already discussed, and see whether it stands up to the same kind of logical tests we have used previously.

The first approach is the one developed by Bill James and used in his work for many years (although it has now been discarded). Bill calculated Runs Created, but knew of course that this was just a quantity and in order to express value, needed to consider the quality of the performance and the “baseball time” (read outs, or PAs, or innings, etc.) that it was accumulated for. Bill, as have most analysts over the years, analyzed the issues we have for the last couple installments and chose R/O as his rate stat. However, this scale was unfamiliar to the average fan. He instead chose to express it in a number of runs/game for a team, since runs/game has a scale which more fans and even sabermetricians are familiar with. So:
R/G = (O/G)*(R/O)

Where O/G is some pre-defined value. Bill used 25 O/G(and later 25.5) when considering just batting outs(AB - H), 25.5(and later 26) when considering (AB - H + CS), and 27 when considering all outs in the official statistics (AB - H + CS + SH + SF). You could use the league average, or the team average, or what have you--25.2 is actually a more precise estimate when using batting outs, 25.5 when including CS, and 27 when including all outs. I will sometimes just abbreviate R/G as RG.

Bill could have expressed this in terms of runs/team season by multiplying by some number of games (probably 162), or in terms of runs/inning, etc. He chose runs/game.

Then he decided to express the R/G figure in terms of a Winning Percentage through the Pythagorean theorem. He defined a player’s Offensive Winning Percentage(OW%) as the winning percentage that a team would have if each player on the team hit as the player hit, and the team allowed a league average number of runs.
OW% = (R/G)^x/((R/G)^x + Lg(R/G)^x)
Where x is the exponent used, generally 2. I will often call the league average of R/G simply as “N”.

OW% shares the property with R/G of familiarity to an average fan--you know that .550 is a contender, .600 is probably one of the best teams in baseball, and that .700 is legendary. Of course, individual performance varies much more than team performance, so you cannot carry those interpretations of the values for teams over to individuals, but they can still serve as guideposts.

Another commonly used method of converting a R/O or R/G figure into another scale is to convert it to a number that looks like a batting average, since all fans are familiar with the BA scale. This is what Clay Davenport does in Equivalent Average(EQA) and it is what Eddie Epstein does with Real Offensive Value(ROV):
EQA = ((R/O)/5)^.2
ROV = .0345 + .1054*sqrt(R/G)

Let’s look at an example of two players who play in a league where the average team scores 4.5 runs/game:
PLAYER . R/G...... OW% ..........EQA........... ROV
A .............. 8.00 ...........760 ................333 ................333
B................ 7.00 ...........708 ................316 ................313

We can see that the ratio in terms of runs between the player is 8/7 = 1.143. The ratio in OW% is 1.073 and the ratio in EQA is 1.054. So these other measures are decreasing the ratio between players. Now this is not necessarily a bad thing, because our ultimate goal is to move from run-based evaluations to win-based. But in fact, the ratios are incorrect.

Based on Pythagorean, Run Ratio = R/RA, or in our case R/G/(LgR/G). Win Ratio equal Run Ratio^x, and W% = Win Ratio/(Win Ratio + 1). So the Win Ratio for Player A is 3.16 (versus a Run Ratio of 1.78) and Player B is 2.42 (versus a Run Ratio of 1.56). So if we have one player with a RR of RR1 and another with a ratio of RR2, the win ratio between the two players are (RR1/RR2)^x, so Player A has a win ratio 31% higher then Player B (versus the 14% higher run ratio). So the WR grows exponentially versus the RR.

So basically, OW%, EQA, and other similar methods reduce the ratios between players and therefore distort the scale. The number you look at may be on a more familiar scale, but the scale distortion may cause confusion. While Davenport and Epstein and others are free to state the end results of their methods however they’d like, I think that the best course of action is to use the R/O or R/G or whatever scale and learn the standards. We all agree that BA is not a useful measure for a player’s total value, so why continue to use that scale? If R/O is the proper measure, let’s learn the scale.

James’s system went on to express a player’s contribution in terms of a number of offensive wins and losses. Since we already have an offensive winning percentage, all we need to find offensive wins and losses is offensive games. By definition in the OW% formula, Games = Outs/25 (or whatever value is appropriate given the type of outs that are being considered). So:
Offensive Wins = OW%*Offensive Games
Offensive Losses = (1 - OW%)*Offensive Games = Offensive Games - Offensive Wins

So now we can apparently express a player’s offensive contribution in terms of wins and losses. Great. But we still have a problem. How do we compare two players with different amounts of playing time? Consider two players:
NAME ...... RC..... O ..... RG ........ OW%......... OW-OL
A ................. 100 .....400 ... 6.25 ......... .659..............10.54-5.46
B ..................88........ 300.....7.33 ...........726 ............. 8.71-3.29
I have assumed that N = 4.5 for both players. Player A has more OW, but he also has more OL and an OW% that is almost seventy points lower. How do we pick between these two, value-wise? Clearly, Player B’s rate of performance is superior. But Player A’s total offensive wins is higher by almost two. It’s not an obvious choice.

Well, you could say, Player A has 5.08 more wins then losses, while Player B has 5.42, so Player B is better. Alrighty then. See what you just did? You put in a baseline. Your baseline was “over .500, times two”. So you are comparing a player to an average player.

Many sabermetricians think that a better comparison is to compare each player to a “replacement” player. This debate is beyond the scope of this article, but let’s just say the replacement player would have an OW% of .350. What if we compare to .350? Well, Player A has 16 offensive games, so a replacement player would be 5.6-10.4 in those games, so our guy is +4.94 wins over him(This is figured by (OW%-.350)*OG/25)) Player B has 12 offensive games, so he is +4.51 wins.

So if we compare to average Player B is ahead, but if we compare to replacement Player A is ahead. And this is very common; different baselines change rankings. This is true no matter what rate stat you start out with. So what’s the point, as it applies to OW%? The point is that having a number that is supposed to be “wins” and “losses” as the OW% system does, as opposed to just having a number of wins above some baseline, as other systems do, is not a panacea. Even if you have absolute wins and losses, you are going to have to use some sort of baseline to sort out who’s better then who. And the other systems can be adapted to other baselines (we’ll talk more about this in the next segment), so the absolute wins and losses aren’t really an advantage of this system.

Moving on from that, let’s use the OW% approach to compare two real player seasons:
NAME ............ RC....... O....... RG ....... OW%....... OW-OL
Mantle.................162....... 301.......13.46........910 ............10.96-1.08
Williams............. 161........ 257...... 15.66 .......932 .............9.58-.70
These are probably two of the three or four best offensive seasons in the major leagues in the 1950s, both turned in in the 1957 AL (N = 4.23) by Mickey Mantle and Ted Williams. We see that Mantle and Williams created almost identical numbers of runs, but that Mantle made 44 more outs. This gives Williams a comfortable edge in RG(about 16% higher), and a smaller but still significant lead in OW% due to the scale distortion(about 2%) higher. So Williams has created just about as many runs as Mantle, and used a lot less outs. So clearly it seems, Williams should rate ahead of Mantle.

However, Mantle has 1.38 more offensive wins. On the other hand, he has .38 more offensive losses. If we compare them to a .500 player, Mantle is +4.94 wins and Williams is +4.44 wins (figured as (OW%-.5)*OG/25). How can this be? How can we rate Mantle ahead of Williams, by half a win, when there is essentially no difference between the number of runs they created but a large difference in the number of outs they made?

In truth, it can’t be. It’s clearly wrong, and is caused by a flaw in the OW% way of thinking. OW% decreases the value of each additional run created. If you have a player with a 4.5 RG in a N = 4.5 league, and you add .5 runs/game and give him a 5 RG, his OW% increases by 52 points, from .500 to .552. If you add another .5 to take him up to 5.5, his OW% increases by 47 points, from .552 to .599. So the additional .5 runs had less win value, according to OW%.

Like many things, there is a grain of truth to this. It is true that for a team, going from 4.5 runs to 5 will cause a greater increase in W% then going from 5 to 5.5. But there is a big difference between a team adding .5 runs per game or one-ninth of a team, an individual player, adding .5 runs per game and batting one-ninth of the time. Each additional run created by a player will in theory have less value then the previous one, but treating the player as a team blows this way out of proportion.

The more fundamental reason why OW% gives this clearly incorrect result is how it defines games. You get credit for more games as you make more outs. Williams’ OW% is .022 higher then Mantle’s, but that is not to offset the fact that we are now crediting Mantle with 1.76 more games then Williams. What would have happened if Mantle would have made 350 outs? Well, his RG would have gone down to 11.57, and his OW% to .882, but his OW-OL would have been 12.35-1.65 for +5.35 WAA. In fact, we could have to increase Mantle’s outs to 465(!!) before his WAA would reach its potential peak, +5.75. That would be a player with an OW% of .809, almost exactly one hundred points lower then Mantle actually was. And all he’s done to have his value increase is make 164 more outs! Clearly, this approach does not and cannot work.

Again, a player is not a team. If we know that one team has made 100 outs, we know that they have played about 4 games. But this is because a team gets 25 outs/game (yes, I know, it’s 27, but we’re using 25 as described above when using just batting outs). A player does not get 25 outs per game. He gets to keep making outs until his team has recorded 25 outs. Then he’s done, whether he has made zero outs (if he has a 1.000 OBA) or all 25 outs (if his eight teammates have a 1.000 OBA). It is unrealistic and silly to state a player’s games in terms of outs.

Of course, even a team does not get one game per 25 outs. The number of outs do not define the number of games a team plays. The number of games define the number of outs they get. Baseball teams play 162 games a year because somebody in the commissioner’s office said they should play 162 games a year, not because somebody in the commissioner’s office said they should make 4,374 outs a year.

So why not use plate appearances, or at bats, or some combination of outs and those? Because this whole exercise is folly. No matter what we do with the number of games, we have defined OW% in terms of games defined by outs. The bottom line is that players don’t play games themselves. We certainly want to move beyond a player’s run contribution and express his win contribution. But to do this, we need to consider how he would affect the wins of his team, not try to create some situation in which he is a team. A team of Mantles will win 91% of their games. Great. We have one Mantle, on a team with eight other guys who aren’t Mantles. We don’t care how many games nine Mantles will win.

This whole folly began when we expressed the player’s performance in terms of runs/out. From there, it was easy to ask, how many runs would he score per game? From there, it was easy to ask, what would his W% be? From there, it was easy to ask, how many games would he win or lose? From there, it was easy to ask, how many games would he win compared to some baseline? And we ended up with absurd conclusions, that any sabermetric hater would see and laugh at, and rightfully so. Runs per out is a team measure. It is alright to apply to players, maybe not theoretically correct, but it will not cause too much distortion. But if you start jumping from just using it as a rate to doing all sorts of other stuff with it, then you will get distortion.

On that note, let me clarify something from earlier. I said that a class 1 run estimator, which measures multiplicative effects, like RC, should be done in terms of R/O. This could be seen as contradicting what I just said. The point is that in fact, a player’s RC is not an accurate representation of his runs contributed to his team, but the people who use it treat it as such. Nobody actually sets out to apply a full class 1 approach to a player, because they recognize that a player is not a team. What they do is apply a class 1 run estimator without realizing that it is incompatible with a full-blown class 2 or 3 evaluation approach.