Walk Like a Sabermetrician: Tripod: Baselines

See the first paragraph of this post for an explanation of this series.

This essay will touch on the topics of various baselines and which are appropriate(in my opinion) for what you are trying to measure. In other words, it discusses things like replacement level. This is a topic that creates a lot of debate and acrimony among sabermetricians. A lot of this has to do with semantics, so all that follows is my opinion, some of it backed by facts and some of it just opinion.

Again, I cannot stress this enough; different baselines for different questions. When you want to know what baseline you want to use, first ask the question: what am I trying to measure?

Anyway, this discussion is kind of disjointed, so I'll just put up a heading for a topic and write on it.

Individual Winning Percentage

Usually the baseline is discussed in terms of a winning percentage. This unfortunate practice stems from Bill James' Offensive Winning Percentage. What is OW%? For instance, if Jason Giambi creates 11 runs per game in a context where the average team scores 5 runs per game, than Giambi's OW% is the W% you would expect when a team scores 11 runs and allows 5(.829 when using Pyth ex 2). It is important to note that OW% assumes that the team has average defense.

So people will refer to a replacement level of say .333, and what they mean is that the player's net value should be calculated as the number of runs or wins he created above what a .333 player would have done. This gets very confusing when people try to frame the discussion of what the replacement level should be in terms of actual team W%s. They'll say something like, "the bottom 1% of teams have an average W% of .300, so let's make .300 the replacement level". That's fine, but the .300 team got its record from both its offense and defense. If the team had an OW% of .300 and a corresponding DW% of .300, their record would be about .155.

Confusing, eh? And part of that comes from the silly idea of putting a player's individual performance in the form of a team's W%. So, I prefer to define replacement level in terms of percentage of the league average the player performed at. It is much easier to deal with, and it just makes more sense. But I may use both interchangeably here since most people discuss this in terms of W%.

ADDED 12/04: You can safely skip this part and understand the rest of the article; it's really about a different subject anyway. I should note the weakness of the % of league approach. The impact of performing at 120% of the league is different at different levels of run scoring. The reason for this is that the % of league for a particular player is essentially a run ratio(like runs scored/runs allowed for a team). We are saying that said player creates 20% more runs than his counterpart, which we then translate into a W% by the Pythagorean by 1.2^2/(1.2^2+1)=.590. But as you can read in the "W% Estimator" article, the ideal exponent varies based on RPG. In a 10 RPG context(fairly normal), the ideal exponent is around 1.95. But in a 8 RPG context, it is around 1.83. So in the first case a 1.2 run ratio gives a .588 W%, but in the other it gives a .583. Now this is a fairly minor factor in most cases, but we want to be as precise as possible.

So from this you might determine that indeed the W% display method is ideal, but the W% approach serves to ruin the proportional relationship between various Run Ratios(with a Pyth exponent of 2, a 2 RR gives an .800 W%, while a 1 RR gives .500, but 2 is 2 times as high as 1, not .8/.5). So the ideal thing as far as I'm concerned is to use the % of league, but translate it into a win ratio by raising it to the proper pythagorean exponent for the context(which can be figured approximately as RPG^.28). But this shouldn't have too big of an impact on the replacement level front. If you like the win ratio idea but want to convert it back into a run ratio, you can pick a "standard" league that you want to translate everybody back into(ala Clay Davenport). So if you want a league with a pyth exponent of 2, take the square root of the win ratio to get the run ratio. Generally (W/L) = (R/RA)^x or (R/RA) = (W/L)^(1/x).

Absolute Value

This is a good place to start. Why do we need a baseline in the first place? Why can't we just look at a player's Runs Created, and be done with it? Sabermetricians, I apologize, this will be quite patronizing for you.

Well, let's start by looking at a couple of players:

H D T HR W

145 17 1 19 13

128 32 2 19 25

The first guy has 68 RC, the second guy has 69. But when you discover that Player A made 338 outs and Player B made 284 outs, the choice becomes pretty clear, no? BTW, player A is Randall Simon and player B is Ivan Rodriguez(2003).

But you could say that we should have known player B was better, because we could just look at his runs/out. But of course I could give you an example of 2 guys with .2 runs/out, but one made 100 outs and had 20 RC and another made 500 outs and had 100 RC. And so you see that there must be some kind of balance between the total and the rate.

The common sense way to do this with a baseline. Some people, like a certain infamous SABR-L poster, will go to extreme lengths to attempt to combine the total and the rate in one number, using all sorts of illogical devices. A baseline is logical. It kills two or three birds with one stone. For one thing, we can incorporate both the total production and the rate of production. For another, we eventually want to evaluate the player against some sort of standard, and that standard can be the baseline that we use. And using a baseline automatically inserts an adjustment for league context.

There is value in every positive act done on a major league field. There is no way that you can provide negative absolute value. If you bat 500 times, make 499 outs and draw 1 walk, you have still contributed SOMETHING to your team. You have provided some value to that team.

But on the other hand, the team could have easily, for the minimum salary, found someone who could contribute far, far more than you could. So you have no value to the team in an economic sense. The team has no reason to pay you a cent, because they can find someone you can put up a .000/.002/.000 line panhandling on the street. This extreme example just goes to show why evaluating a major league player by the total amount of production he has put up is silly. That leads into the question of what is level at which a team can easily find a player who can play that well?

Minimum Level

This is where a lot of analysts like to draw the baseline. They will find the level at which there are dozens of available AAA players who perform that well, and that is the line against which they evaluate players. Those players are numerous and therefore have no real value to a team. They can call up another one from AAA, or find one on waivers, or sign one out of the Atlantic League. Whatever.

There are a number of different ways of describing this, though. One is the "Freely Available Talent" level. That's sort of the economic argument I spelled out. But is it really free? This might be nitpicking, but I think it is important to remember that all teams spend a great deal of money on player development. If you give your first round pick a $2 million bonus and he turns out to be a "FAT" player, he wasn't really free. Of course, he is freely available to whoever might want to take him off your hands. But I like the analogy of say, getting together with your friends, and throwing your car keys in a box, and then picking one randomly and taking that car. If you put your Chevy Metro keys in there and draw out somebody's Ford Festiva keys, you didn't get anywhere. And while you now have the Festiva, it wasn't free. This is exactly what major league teams do when they pick each other's junk up. They have all poured money into developing the talent and have given up something to acquire it(namely their junk). None of this changes the fact that it is freely available or really provides any evidence against the FAT position at all; I just think it is important to remember that the talent may be free now, but it wasn't free before. Someone on FanHome proposed replacing FAT with Readily Available Talent or something like that, which makes some sense.

Another way people define this is the level at which a player can stay on a major league 25 man roster. There are many similar ways to describe it, and while there might be slight differences, they all are getting at the same underlying principle.

The most extensive study to establish what this line is was undertaken by Keith Woolner in the 2002 Baseball Prospectus. He determined that the minimum level was about equal to 80% of the league average, or approximately a .390 player. He, however, looked at all non-starters, producing a mishmash of bench players and true FAT players.

The basic idea behind all of these is that if a player fell of the face of the earth, his team would have to replace him, and the player who would replace them would be one of these readily available players. So it makes sense to compare the player to the player who would replace him in case of injury or other calamity.

A W% figure that is often associated with this line of reasoning is .350, although obviously there is no true answer and various other figures might give a better representation. But .350 has been established as a standard by methods like Equivalent Runs and Extrapolated Wins, and it is doubtful that it will be going anywhere any time soon.

Sustenance Level

This is kind of similar to the above. This is the idea that there is some level of minimum performance at which the team will no longer tolerate the player, and will replace him. This could be other from his status on the roster or as a starting player(obviously, the second will produce a higher baseline in theory). You could also call this "minimum sustainable performance" level.

Cliff Blau attempted a study to see when regular players lost their jobs based on their RG, at each position. While I have some issues with Blau's study, such as that it did not include league adjustments while covering some fairly different offensive contexts, his results are interesting none the less. He found no black line, no one level where teams threw in the towel. This really isn't that surprising, as there are a number of factors involved in whether or not a player keeps his job other than his offensive production(such as salary, previous production, potential, defensive contribution, nepotism, etc). But Bill James wrote in the 1985 Abstract that he expected there would be such a point. He was wrong, but we're all allowed to be sometimes.

Anyway, this idea makes sense. But a problem with it is that it is hard to pin down exactly where this line is-or for that matter, where the FAT line is. We don't have knowledge of a player's true ability, just a sample of varying size. The team might make decisions on who to replace based on a non-representative sample, or the sabermetrician might misjudge the talent of players in his study and thus misjudge the talent level. There are all sorts of selective sampling issues here. We also know that the major leagues are not comprised of the 750 best players in professional baseball. Maybe Ricky Weeks could hit better right now then the Brewers' utility infielder, but they want him to play every day in the minors. The point is, it is impossible to draw a firm baseline here. All of the approaches involve guesswork, as they must.

Some people have said we should define replacement level as the W% of the worst team in the league. Others have said it should be based on the worst teams in baseball over a period of years. Or maybe we should take out all of the starting players from the league and see what the performance level of the rest of them is. Anyway you do it, there's uncertainty, large potential for error, and a need to remember there's no firm line.

Average

But the uncertainty of the FAT or RAT or whatever baseline does leave people looking for something that is defined, and that is constant. And average fits that bill. The average player in the league always performs at a .500 level. The average team always has a .500 W%. So why not evaluate players based on their performance above what an average player would have done?

There are some points that can be made in favor of this approach. For one thing, the opponent that you play is on average a .500 opponent. If you are above .500, you will win more often then you lose. If you are below .500, you will lose more often that you win. The argument that a .500+ player is doing more to help his team win then his opponent is, while the .500- player is doing less to help his team win then his opponent is, makes for a very natural demarcation: win v loss.

Furthermore, the .500 approach is inherently built into any method of evaluating players that relies on Run Expectancy or Win Expectancy, such as empirical Linear Weights formulas. If you calculate the run value of each event as the final RE value minus the initial RE value plus runs scored on the play(which is what empirical LW methods are doing, or the value added approach as well), the average player will wind up at zero. Now the comparison to zero is not inevitable; you can fudge the formula or the results to compare to a non-.500 baseline, but initially the method is comparing to average.

An argument that I have made on behalf of the average baseline is that, when looking back in hindsight on the season, the only thing that ultimately matters is whether or not the player helped you to win more than your opponent. An opponent of the average baseline might look to a .510 player with 50 PA and say that he is less valuable than a .490 player with 500 PA, since the team still had 450 additional PA with the first player. This is related to the "replacement paradox" which I will discuss later, but ignoring that issue for now, my argument back would be that it is really irrelevant, because the 450 PA were filled by someone, and there's no use crying over spilled milk. The .490 player still did less to help his team win than his opponent did to help his team win. It seems as if the minimum level is more of a forward looking thing, saying "If a team could choose between two players with these profiles, they would take the second one", which is surely true. But the fact remains that the first player contributed to wins more than his opponent. From a value perspective, I don't necessarily have to care about what might have happened, I can just focus on what did happen. It is similar to the debate about whether to use clutch hitting stats, or actual pitcher $H data, even when we know that these traits are not strongly repetitive from season to season. Many people, arguing for a literal value approach, will say that we should use actual hits allowed or a player's actual value added runs, but will insist on comparing the player to his hypothetical replacement. This is not a cut and dry issue, but it reminds us of why it is so important to clearly define what we are trying to measure and let the definition lead us to the methodology.

Average is also a comfortable baseline for some people to use because it is a very natural one. Everybody knows what an average is, and it is easy to determine what an average player's Batting Average or walk rate should be. Using a non-.500 baseline, some of this inherent sense is lost and it is not so easy to determine how a .350 player for instance should perform.

Finally, the most readily accessible player evaluation method, at least until recently, was Pete Palmer's Linear Weights system. In the catch-all stat of the system, Total Player Rating, he used an average baseline. I have heard some people say that he justified because if you didn't go .500, you couldn't make the playoffs in the later editions of Total Baseball. However, in the final published edition, on page 540, he lays out a case for average. I will quote it extensively here since not many people have access to the book:

The translation from the various performance statistics into the wins or losses of TPR is accomplished by comparing each player to an average player at his position for that season in that league. While the use of the average player as the baseline in computing TPR may not seem intuitive to everyone, it is the best way to tell who is helping his team win games and who is costing his team wins. If a player is no better than his average counterparts on other teams, he is by definition not conferring any advantage on his team. Thus, while he may help his team win some individual games during the season--just as he will also help lose some individual games--over the course of a season or of a career, he isn't helping as much as his opponents are. Ultimately, a team full of worse-than-average players will lose more games than it wins.

The reason for using average performance as the standard is that it gives a truer picture of whether a player is helping or hurting his team. After all, almost every regular player is better than his replacement, and the members of the pool of replacement players available to a team are generally a lot worse than average regulars, for obvious reasons.

If Barry Bonds or Pedro Martinez is out of the lineup, the Giants or the Red Sox clearly don't have their equal waiting to substitute. The same is typically true for lesser mortals: when an average ballplayer cannot play, his team is not likely to have an average big-league regular sitting on the bench, ready to take his place.

Choosing replacement-level performance as the baseline for measuring TPR would not be unreasonable, but it wouldn't give a clear picture of how the contributions of each player translate into wins or losses. Compared to replacement-level performance, all regulars would look like winners. Similarly, when compared to a group of their peers, many reserve players would have positive values, even though they would still be losing games for their teams. Only the worst reserves would have negative values if replacement level were chosen as the baseline.

The crux of the problem is that a team composed of replacement-level players(which would be definition be neither plus nor minus in the aggregate if replacement-level is the baseline) would lose the great majority of its games! A team of players who were somewhat better than replacement level--but still worse than their corresponding average regulars--would lose more games than it won, even though the player values(compared to a replacement-level baseline) would all be positive.

Median

This is sort of related to the average school of thought. But these people will say that since the talent distribution in baseball is something like the far right hand portion of a bell curve, there are more below average players than above average players, but the superior performance of the above average players skew the mean. The average player may perform at .500, but if you were given the opportunity to take the #15 or #16 first baseman in baseball, they would actually be slightly below .500. So they would suggest that you cannot fault a player for being in below average if he is in the top half of players in the game.

It makes some sense, but for one thing, the median in Major League baseball is really not that dissimilar to the mean. A small study I did suggested that the median player performs at about 96% of the league mean in terms of run creation(approx. .480 in W% terms). It almost a negligible difference. Maybe it is farther from the mean than that(as other studies have suggested), but either way, it just does not seem to me to be a worthwhile distinction, and most sabermetricians are sympathetic to the minimum baseline anyway, so few of them would be interested in a median baseline that really is not much different from the mean.

Progressive Minimum

The progressive minimum school of thought was first expressed by Rob Wood, while trying to reconcile the average position and the minimum position, and was later suggested independently by Tango Tiger and Nate Silver as well. This camp holds that while if a player is injured and the team must scramble to find a .350 replacement, that does not bind them to using the .350 replacement forever. A true minimal level supporter wants us to compare Pete Rose, over his whole 20+ year career, to the player that would have replaced him had he been hurt at some point during that career. But if Pete Rose had been abducted by aliens in 1965, would the Reds have still been forced to have a .350 player in 1977? No. A team would either make a trade or free agent signing to improve, or the .350 player would become better and save his job, or a prospect would eventually come along to replace him.

Now the minimum level backer might object, saying that if you have to use resources to acquire a replacement, you are sacrificing potential improvement in other areas. This may be true to some extent, but every team at some point must sacrifice resources to improve themselves. It is not as if you can run a franchise solely on other people's trash. A team that tried to do this would eventually have no fans and would probably be repossessed by MLB. Even the Expos do not do this; they put money into their farm system, and it produced players like Guerrero and Vidro. They produced DeShields who they turned into Pedro Martinez. Every team has to make some moves to improve, so advocates of the progressive or time dependent baseline will say that it is silly to value a player based on an unrealistic representation of the way teams actually operate.

So how do we know how fast a team will improve from the original .350 replacement? Rob Wood and Tango looked at it from the perspective of an expansion teams. Expansion teams start pretty much with freely available talent, but on average, they reach .500 in 8 years. So Tango developed a model to estimate the W% of such a team in year 1, 2, 3, etc. A player who plays for one year would be compared to .350, but his second year might be compared to .365, etc. The theory goes that the longer a player is around, the more chances his team has had to replace him with a better player. Eventually, a team will come up with a .500 player. After all, the average team, expending an average amount of resources, puts out a .500 team.

Another area you could go to from here is whether or not the baseline should ever rise above .500. This is something that I personally am very uneasy with, since I feel that any player who contributes more to winning than his opponent does should be given a positive number. But you could make the case that if a player plays for 15 years in the show, at some point he should have provided above average performance. This approach would lead to a curve for a player's career that would rise from .350, up over .500 maybe to say .550 at its peak, and then tailing back down to .350. Certainly an intriguing concept.

Silver went at it differently, by looking at player's offensive performance charted against their career PA. It made a logarithmic curve and he fitted a line to it. As PA increase, offensive production rapidly increases, but then the curve flattens out. Comparing Silver's work to Tango's work, the baselines at various years were similar. This was encouraging to see similar results coming from two totally different and independent approaches.

A common argument against the progressive baseline is that even if you can eventually develop a .500 replacement, the presence of your current player does not inhibit the development of the replacement, so if your player does not get hurt or disappear, you could peddle the replacement to shore up another area, or use him as a backup, or something else. This is a good argument, but my counter might be that it is not just at one position where you will eventually develop average players; it is all over the diamond. The entire team is trending toward the mean(.500) at any given time, be it from .600 or from .320. Another potential counter to that argument is that some players can be acquired as free agent signings. Of course, these use up resources as well, just not human resources.

The best argument that I have seen against the progressive level is that if a team had a new .540 first baseman every year for 20 years, each would be evaluated against .350 first baseman. But if a team had the same .540 first baseman for 20 years, he would be evaluated against a .350, then a .365, then a .385, etc, and would be rated as having less value then the total of the other team's 20 players, even though each team got the exact same amount of production out of their first base position. However, this just shows that the progressive approach might not make sense from a team perspective, but does makes sense from the perspective of an individual player's career. Depending on what we want to measure, we can use different baselines.

Chaining

This is the faction that I am most at home in, possibly because I published this idea on FanHome. I borrowed the term "chaining" from Brock Hanke. Writing on the topic of replacement level in the 1998 BBBA, he said something to the effect that win you lose your first baseman, you don't just use him. You lose your best pinch hitter, who now has to man first base, and then he is replaced by some bum.

But this got me to thinking: if the team is replacing the first baseman with its top pinch hitter, who must be a better than minimum player or else he could easily be replaced, why must we compare the first baseman to the .350 player who now pinch hits? The pinch hitter might get 100 PA, but the first baseman gets 500 PA. So the actual effect on the team when the first baseman is lost is not that it gives 500 PA to a .350 player; no, instead it gives 500 PA to the .430 pinch hitter and 100 PA to the .350 player. And all of that dynamic is directly attributable to the first baseman himself. The actual baseline in that case should be something like .415.

The fundamental argument to back this up is that the player should be evaluated against the full scenario that would occur if he has to be replaced, not just the guy who takes his roster spot. Let's run through an example of chaining, with some numbers. Let's say that we have our starting first baseman who we'll call Ryan Klesko. We'll say Klesko has 550 PA, making 330 outs, and creates 110 runs. His backup racks up 100 PA, makes 65 outs, and creates 11 runs. Then we have a AAA player who will post a .310 OBA and create .135 runs/out, all in a league where the average is .18 runs/out. That makes Klesko a .775 player, his backup a .470 player, and the AAA guy a .360 player(ignoring defensive value and the fact that these guys are first baseman for the sake of example; we're also ignoring the effect of the individual's OBAs on their PAs below-the effect might be slight but it is real and would serve to decrease the performance of the non-Klesko team below). Now in this league, the FAT level is .135 R/O. So a minimalist would say that Klesko's value is (110/330-.135)*330 = +65.5 RAR. Or, alternatively, if the AAA player had taken Kleko's 550 PA(and it is the same thing as doing the RAR calculation), he would have 380 outs and 51 runs created.

Anyway, when Klesko and his backup are healthy, the team's first baseman have 650 PA, 395 outs, and 121 RC. But what happens if Klesko misses the season? His 550 PA will not go directly to the bum. The backup will assume Klesko's role and the bum will assume his. So the backup will now make 550/100*65=358 outs and create 11/65*358=61 runs. The bum will now bat 100 times, make 69 outs, and create .135*69=9 runs. So the team total now has 427 outs and 70 RC from its first baseman. We lose 51 runs and gain 32 outs. But in the first scenario, with the bum replacing Klesko directly(which is what a calculation against the FAT line implicitly assumes), the team total would be 445 outs and 62 runs created. So the chaining subtracts 18 outs and adds 8 runs. Klesko's real replacement is the 70/427 scenario. That is .164 runs/out, or 91% of the league average, or a .450 player. That is Klesko's true replacement. A .450 player. A big difference from the .360 player the minimalists would assume.

But what happens if the backup goes down? Well, he is just replaced directly by the bum, and so his true replacement level is a .360 player. Now people will say that it is unfair for Klesko to be compared to .450 and his backup to be compared to .360. But from the value perspective, that is just the way it is. The replacement for a starting player is simply a higher level than the replacement for a backup. This seems unfair, and it is a legitamite objection to chaining. But I suggest that it isn't that outlandish. For one thing, it seems to be the law of diminishing returns. Take the example of a team's run to win converter. The RPW suggested by Pythagorean is:

RPW = RD:G/(RR^x/(RR^x+1)-.5)

Where RD:G is run differential per game, RR is run ratio, and x is the exponent. We know that the exponent is approximately equal to RPG^.29. So a team that scores 5 runs per game and allows 4 runs per game has an RPW of 9.62. But what about a team that scores 5.5 and allows 4? Their RPW is 10.11.

So a team that scores .5 runs more than another is buying their wins at the cost of an additional .49 runs. This is somewhat similar to a starting player deriving value by being better than .450, and a backup deriving value by being better than .360. Diminishing returns. Now obviously, if your starter is .450, your backup must be less than that. So maybe the chained alternative should be tied to quality in the first place. Seems unfair again? Same principle. It's not something that we are used to considering in a player evaluation method, so it seems very weird, but the principle comes into play in other places(such as RPW) and we don't think of it as such because we are used to it.

Now an alternative way of addressing this is to point out the concept of different baselines for different purposes. A starting player, to keep his starting job, has a higher sustenance level than does a backup. Now since backups max out at say 200 PA, we could evaluate everyone's first 200 PA against the .360 level and their remaining PA against the .450 level. This may seem unfair, but I feel that it conforms to reality. A .400 player can help a team, but not if he gets 500 PA.

Some other objections to chaining will invariably come up. One objection is that not all teams have a backup to plug in at every position. Every team will invariably have a backup catcher, and somebody who can play some infield positions and some outfield positions, but maybe not on an everyday basis. And this is true. One solution might be to study the issue and find that say 65% of teams have a bench player capable of playing center field. So then the baseline for centerfield would be based 65% on chaining and 35% on just plugging the FAT player into the line. Or sometimes, more than one player will be hurt at once and the team will need a FAT player at one position. Another is that a player's position on the chain should not count against him. They will say that it is not the starter's fault that he has more room to be replaced under him. But really, it's not counting against him. This is the diminishing returns principle again. If he was not a starter, he would have less playing time, and would be able to accrue less value. And if you want to give "Klesko" credit for the value that his backup has that is greater than his backup, fine. You are giving him credit for a .360 player, but only over 100 PA, rather than the minimalist, who will extend him that value over all 550 of his PA. That is simply IMO not a realistic assessment of the scenario. All of these things just demonstrate that the baseline will not in the end be based solely on chaining; it would incorporate some of the FAT level as well.

When chaining first came up on FanHome, Tango did some studies of various things and determined that in fact, chaining coupled with adjusting for selective sampling could drive the baseline as high as 90%. I am not quite as gung ho, and I'm not sure that he still is, but I am certainly not convinced that he was wrong either.

Ultimately, it comes down to whether or not we are trying to model reality as best as possible or if we have an idealized idea of value. It is my opinion that chaining, incorporated at least somewhat in the baseline setting process, best models the reality of how major league teams adjust to loss in playing time. And without loss in playing time(actually variance in playing time), everyone would have equal opportunity and we wouldn't be having this darn discussion. Now I will be the first to admit that I do not have a firm handle on all the considerations and complexities that would go into designing a total evaluation around chaining. There are a lot of studies we would need to do to determine certain things. But I do feel that it must be incorporated into any effort to settle the baseline question for general player value.

Plus-.500 Baselines

If a .500 advocate can claim that the goal of baseball is to win games and sub-.500 players contribute less to winning then do their opponent, couldn't someone argue that the real goal is to make the playoffs, and that requires say a .560 W%, so shouldn't players be evaluated against .560?

I suppose you could make that argument. But to me at least, if a player does more to help his team win than his opponent does to help his team win, he should be given a positive number of a rating. My opinion, however, will not do much to convince people of this.

A better argument is that the idea of winning pennants or making the playoffs is a separate question than just winning. Let's take a player who performs at .100 one season and at .900 in the other. The player will rate, by the .560 standard, as a negative. He has hurt his team in its quest to win pennants.

But winning a pennant is a seasonal activity. In the season in which our first player performed at .900, one of the very best seasons in the history of the game, he probably added 12 wins above average to his team. That would take an 81 win team up to 93 and put them right in the pennant hunt. He has had an ENORMOUS individual impact on his team's playoff hopes, similar to what Barry Bonds has done in recent years for the Giants.

So his team wins the pennant in the .900 season, and he hurts their chances in the second season. But is their a penalty in baseball for not winning the pennant? No, there is not. Finishing 1 game out of the wildcard chase is no better, from the playoff perspective, than finishing 30 games out. So if in his .100 season he drags an 81 win team down to 69 wins, so what? They probably weren't going to make the playoffs anyway.

As Bill James said in the Politics of Glory, "A pennant is a real thing, an object in itself; if you win it, it's forever." The .100 performance does not in any way detract from the pennant that the player provided by playing .900 in a different season.

And so pennant value is a totally different animal. To properly evaluate pennant value, an approach such as the one proposed by Michael Wolverton in the 2002 Baseball Prospectus is necessary. Using a baseline in the traditional sense simply will not work.

Negative Value/Replacement Paradox

This is a common area of misunderstanding. If we use the FAT baseline, and a player rates negatively, we can safely assume that he really does have negative value. Not negative ABSOLUTE value--nobody can have negative absolute value. But he does have negative value to a major league team, because EVERYBODY from the minimalist to the progressivists to the averagists to the chainists would agree that they could find, for nothing, a better player.

But if we use a different baseline(average in particular is used this way), a negative runs or wins above baseline figure does not mean that the player has negative value. It simply means that he has less value then the baseline he is being compared to. It does not mean that he should not be employed by a major league team.

People will say something like, ".500 proponents would have us believe that if all of the sub-.500 players in baseball retired today, there would be no change in the quality of play tomorrow". Absolute hogwash! An average baseline does not in anyway mean that its proponents feel that a .490 player has no value or that there is an infinite supply of .500 players as there are of .350 players. It simply means that they choose to compare players to their opponent. It is a relative scale.

Even Bill James does not understand this or pretends not to understand this to promote his own method and discredit TPR(which uses a .500 baseline). For instance, in Win Shares, he writes: "Total Baseball tells us that Billy Herman was three times the player that Buddy Myer was." No, that's not what it's telling you. It's telling you that Herman had three times more value above his actual .500 opponent than did Myer. He writes "In a plus/minus system, below average players have no value." No, it tells you that below average players are less valuable than their opponent, and if you had a whole team of them you would lose more than you would win.

These same arguments could be turned against a .350 based system too. You could say that I rate at 0 WAR, since I never played in the majors, and that the system is saying that I am more valuable than Alvaro Espinoza. It's the exact same argument, and it's just as wrong going the other way as it is going this way.

And this naturally leads into something called the "replacement paradox". The replacement paradox is essentially that, using a .500 baseline, a .510 player with 10 PA will rate higher than a .499 player with 500 PA. And that is true. But the same is just as true at lower baselines. Advocates of the minimal baseline will often use the replacement paradox to attack a higher baseline. But it the sword can be turned against them. They will say that nobody really cares about the relative ratings of .345 and .355 players. But hasn't a .345 player with 500 PA shown themselves to have more ability than a .355 player with 10 PA. Yes, they have. Of course, on the other hand, they have also provided more evidence that they are a below average player as well. That kind of naturally leads in to the idea of using the baseline to estimate a player's true ability. Some have suggested a close to .500 baseline for this purpose. Of course, the replacement paradox holds wherever you go from 0 to 1 on the scale. I digress; back to the replacement paradox as it pertains to the minimal level. While we may not care that much about how .345 players rate against .355 players, it is also true that we're not sure exactly where that line really is as we are with the .500 line. How confident are we that it is .350 and not .330 or .370? And that uncertainty can wreck havoc with the ratings of players who for all we know could be above replacement level.

And now back to player ability; really, ability implies a rate stat. If there is an ability to stay healthy(some injuries undoubtedly occur primarily because of luck--what could Geoff Jenkins have done differently to avoid destroying his ankle for instance), and there almost certainly is, than that is separate from a player's ability to perform when he is on the field. And performance when you are on the field is a rate, almost by definition. Now a player who has performed at a .600 level for 50 PA is certainly not equal to someone with a .600 performance over 5000 PA. What we need is some kind of confidence interval. Maybe we are 95% confident that the first player's true ability lies between .400 and .800, and we are 95% confident that the second player's true ability lies between .575 and .625. Which is a safer bet? The second. Which is more likely to be a bum? The first, by a huge margin. Which is the more likely to be Babe Ruth? The first as well. Anyway, ability is a rate and does not need a baseline.

Win Shares and the Baseline

A whole new can of worms has been opened up by Bill James' Win Shares. Win Shares is very deceptive, because the hook is that a win share is 1/3 of an absolute win. Except it's not. It can't be.

Win Shares are divided between offense and defense based on marginal runs, which for the purpose of this debate, are runs above 50% of the league average. So the percentage of team marginal runs that come from offense is the percentage of team win shares that come from offense. Then each hitter gets "claim points" for their RC above a 50% player.

Anyway, if we figure these claim points for every player, some may come in below the 50% level. What does Bill James do? He zeroes them out. He is saying that the minimum level of value is 50%, but you do not have negative value against this standard, no matter how bad you are. Then, if you total up the positive claim points for the players on the team, the percentage of those belonging to our player is the percentage of offensive win shares they will get. The fundamental problem here is that he is divvying up absolute value based on marginal runs. The distortion may not be great, but it is there. Really, as David Smyth has pointed out, absolute wins are the product of absolute runs, and absolute losses are the product of absolute runs allowed. In other words, hitters don't make absolute losses, and pitchers don't make absolute wins. The proper measure for a hitter is therefore the number of absolute wins he contributes compare to some baseline. And this is exactly what every RAR or RAA formula does.

Now, to properly use Win Shares to measure value, you need to compare it to some baseline. Win Shares are incomplete without Loss Shares, or opportunity. Due to the convoluted nature of the method, I have no idea what the opportunity is. Win Shares are not really absolute wins. I think the best explanation is that they are wins above approximately .200, masqueraded as absolute wins. They're a mess.

My Opinions

The heading here is a misnomer, because my opinions are littered all throughout this article. Anyway, I think that the use of "replacement level" to mean the FAT, or RAT, or .350 level, has become so pervasive in the sabermetric community that some people have decided they don't need to even defend it anymore, and can simply shoot down other baselines because they do not match. This being an issue of semantics and theory, and something you can't prove like a Runs Created method, you should always be open to new ideas and be able to make a rational defense of your position. Note that I am not accusing all of the proponents of the .350 replacement rate as closed-minded; there are however some who don't wish to even consider other baselines.

A huge problem with comparing to the bottom barrel baseline is intelligibility. Let's just grant for the sake of discussion that comparing to .350 or .400 is the correct, proper way to go. Even so, what do I do if I have a player who is rated as +2 WAR? Is this a guy who I want to sign to a long term contract? Is he a guy who I should attempt to improve upon through trade, or is he a guy to build my team around?

Let's say he's a full time player with +2 WAR, making 400 outs. At 25 outs/game, he has 16 individual offensive games(following the IMO faulty OW% methodology as discussed at the beginning), and replacement level is .350, so his personal W% is .475. Everyone, even someone who wants to set the baseline at .500, would agree that this guy is a worthwhile player who can help a team, even in a starting job. He may contribute less to our team than his opponent does to his, but players who perform better than him are not that easy to find and would be costly to acquire.

So he's fine. But what if my whole lineup and starting rotation was made up of +2 WAR players? I would be a below average team at .475. So while every player in my lineup would have a positive rating, I wouldn't win anything with these guys. If I want to know if my team will be better than the team of my opponent, I'm going to need to figure out how many WAR an average player would have in 400 outs. He'd have +2.4, so I'm short by .4. Now I have two baselines. Except what is my new baseline? It's simply WAA. So to be able to know how I stack up against the competition, I need the average baseline as well.

That is not meant to be a refutation of value against the minimum; as I said, that theory of value could be completely true and the above would still hold. But to interpret the results, I'm going to need to introduce, whether intentionally or not, a higher baseline. Thus, I am making the case here that even if the minimum baseline properly describes value, other baselines have relevance and are useful as well.

As for me, I'm not convinced that it does properly define value. Why must I compare everybody to the worst player on the team? The worst player on the team is only there to suck up 100 PA or 50 mop up IP. A starting player who performs like the worst player is killing my team. And chaining shows that the true effect on my team if I was to lose my starter is not to simply insert this bum in his place in many cases. So why is my starter's value based on how much better he is than that guy?

Isn't value supposed to measure the player's actual effect on the team? So if the effect on my team losing a starter is that a .450 composite would take his place, why must I be compelled to compare to .350? It is ironic that as sabermetricians move more and more towards literal value methods like Win Expectancy Added and Value Added Runs or pseudo-literal value methods like Win Shares, which stress measuring what actually happened and crediting to the player, whether we can prove he has an ability to repeat this performance or not, they insist on baselining a player's value against what a hypothetical bottom of the barrel player would do and not by the baseline implied by the dynamics of a baseball team.

I did a little study of the 1990-1993 AL, classifying 9 starters and 4 backups for each team by their primary defensive position. Anyone who was not a starter or backup was classified as a FAT player. Let me point out now that there are numerous potential problems with this study, coming from selective sampling, and the fact that if a player gets hurt, his backup may wind up classified as the starter and the true starter as the backup, etc. There are many such issues with this study but I will use it as a basis for discussion and not as proof of anything.

The first interesting thing is to look at the total offensive performance for each group. Starters performed at 106.9 in terms of adjusted RG, or about a .530 player. Backups came in at 87.0, or about .430. And the leftovers, the "FAT" players, were at 73.8, or about .350.

So you can see where the .350 comes from. The players with the least playing time have an aggregate performance level of about .350. If you combine the totals for bench players and FAT players, properly weighted, you get 81.7, or about .400. And there you can see where someone like Woolner got 80%--the aggregate performance of non-starters.

Now, let's look at this in terms of a chaining sense. There are 4 bench players per team, and let's assume that each of them can play one position each on an everyday basis. We'll ignore DH, since anyone can be a DH. So 50% of the positions have a bench player, who performs at the 87% level, who can replace the starter, and the other half must be replaced by a 73.8%. The average there is 80.4%, or about .390. So even using these somewhat conservative assumptions, and even if the true FAT level is .350, the comparison point for a starter is about .390, in terms of the player who can actually replace him.

Just to address the problems in this study again, one is that if the starter goes down, the backup becomes the starter and the FAT guy becomes the backup. Or the backup goes downit doesn't adjust for the changes that teams are forced to make, that skew the results. Another is that if a player does not perform, he may be discarded. So there is selective sampling going on. A player might have the ability to perform at a .500 level, but if he plays at a .300 level for 75 PA, they're going to ship him out of their. This could especially effect the FAT players; they could easily be just as talented as the bench players, but hit poorly and that's why they didn't get enough PA to qualify as a bench player.

The point was not to draw conclusions from the study. Anyway, why can't we rate players against three standards? We could compare(using the study numbers even though we know they are not truly correct) to the .530 level for the player's value as a starter; to .430 for the player's value as a bench player; and to .350 for a player's value to play in the major leagues at any time. Call this the "muti-tiered" approach. And that's another point. The .350 players don't really have jobs! They get jobs in emergencies, and don't get many PA anyway. The real level of who can keep a job, on an ideal 25 man roster, is that bench player. Now if the average bench player is .430, maybe the minimum level for a bench player is .400.

Anyway, what is wrong with having three different measures for three different areas of player worth? People want one standard on which to evaluate players, for the "general" question of value. Well why can't we measure the first 50 PAs against FAT, the next 150 against the bench, and the others against a starter? And you could say, "Well, if a .400 player is a starter, that's not his fault; he shouldn't have PAs 200-550 measured against a starter." Maybe, maybe not. If you want to know what this player has actually done for his team, you want to take out the negative value. But if you're not after literal value, you could just zero out negative performances.

A player who gets 150 PA and plays at a .475 level has in a way helped his team, relative to his opponent. Because while the opponent as a whole is .500, the comparable piece on the other team is .430. So he has value to his team versus his opponent; the opponent doesn't have a backup capable of a .475 performance. But if the .475 player gets 600 PA, he is hurting you relative to your opponent.

And finally, let's tie my chaining argument back in with the progressive argument. Why should a player's career value be compared to a player who is barely good enough to play in the majors for two months? If a player is around for ten years, he better at some point during his career at least perform at the level of an average bench player.

Now the funny thing is that I am about to end this article, I am not going to tell you what baseline I want to use, if I was forced to choose one baseline. I know for sure it wouldn't be .350. It would be something between .400 and .500. .400 on the conservative side, evaluating the raw data without taking its biases into account. .500 if I say "screw it all, I want literal value." But I think I will use .350, and publish RAR, because even though I don't think it is right, it seems to be the consensus choice among sabermetricians, many of whom are smarter than me. Is that a sell out? No. I'll gladly argue with them about it any time they want. But since I don't really know what it is, and I don't really know how to go about studying it to find it and control all the biases and pitfalls, I choose to error on the side of caution. EXTREME caution(got to get one last dig in at the .350 level).

I have added a spreadsheet that allows you to experiment with the "multi-tiered" approach. Remember, I am not endorsing it, but I am presenting it as an option and one that I personally think has some merit.

Let me first explain the multi-tiered approach as I have sketched it out in the spreadsheet. There are two "PA Thresholds". The first PA threshold is the "FAT" threshold; anyone with less PA than this is considered a FAT player. The second is the backup threshold; anyone with more PA than that is a starter. Anyone with a number of PA in between is the two thresholds is a backup.

So, these are represented as PA Level 1 and PA Level 2 in the spreadsheet, and are set at 50 and 200. Play around with those if you like.

Next to the PA Levels, there are percentages for FAT, backup, and starter. These are the levels at which value is considered to start for a player in each class, expressed as a percentage of league r/o. Experiment with these too; I have set the FAT at .73, the backup at .87, and the starter at 1.

Below there, you can enter the PA, O, and RC for various real or imagined players. "N" is the league average runs/game. The peach colored cells are the ones that do the calculations, so don't edit them.

RG is simply the players' runs per game figure. V1 is the player's value compared to the FAT baseline. V2 is the player's value compared to the backup baseline. V3 is the player's value compared to the starter baseline. v1 is the player's value against the FAT baseline, with a twist; only 50 PA count(or whatever you have set the first PA threshold to be). Say you have a player with 600 PA. The multi-tiered theory holds that he has value in being an above FAT player, but only until he reaches the first PA threshold. Past that, you should not have to play a FAT player, and he no longer has value. If the player has less than 50 PA, his actual PA are used.

v2 does the same for the backup. Only the number of PA between the two thresholds count, so with the default, there are a maximum of 150 PA evaluated against this level. If the player has less than the first threshold, he doesn't get evaluated here at all; he gets a 0.

v3 applies the same concept to the starter level. The number of PA that the player has over the second threshold are evaluated against the starter level. If he has less than the second threshold, he gets a 0.

SUM is the sum of v1, v2, and v3. This is one of the possible end results of the multi-tiered approach. comp is the baseline that the SUM is comparing against. A player's value in runs above baseline can be written as:

(R-xN)*O/25 = L

Where x is the baseline, 25 is the o/g default, and L is the value. We can solve for x even if the equation we use to find L does not explicitly use a baseline(as is the case here):

x = (RG-25*L/O)/N

So the comp is the effective baseline used by the multi-tiered approach. As you will see, this will vary radically, which is the point of the multi-tiered approach.

The +only SUM column is the sum of v1, v2, and v3, but only counting positive values. If a player has a negative value for all 3, his +only SUM is 0. If a player has a v1 of +3, a v2 of +1, and a v3 of -10, his +only SUM is +4, while his SUM would have been -6. The +only SUM does nto penalize the player if he is used to an extent at which he no longer has value. It is another possible final value figure of the multi-tiered approach. The next column, comp, does the same thing as the other comp problem, except this time it is based on the +only SUM rather than the SUM.

So play around with this spreadsheet. See what the multi-tiered approach yields and whether you think it is worth the time of day.

One of the objections that would be raised to the multi-tiered approach is that there is a different baseline for each player. I think this is the strength of it, of course. Think of it like progressive tax brackets. It is an uncouth analogy for me to use, but it helps explain it, and we're not confiscating property from these baseball players. So, let's just say that the lowest bracket starts at 25K at 10%, and then at 35K it jumps to 15%. So would you rather make 34K or 36K?

It's an absurd question, isn't it? At 34K, you will pay $900 in taxes, while at 36K you will pay $1150, but your net income is $33,100 against $34,850. Of course you want to make 36K, even if that bumps you into the next bracket.

The same goes for the multi-tiered value. Sure, a player who performs at 120% of the league average in 190 PA is having his performance "taxed" at 83%, and one who has 300 PA is being "taxed" at 89%. But you'd still rather have the guy in 300 PA, and you'd still rather make 36 grand.

Now, just a brief word about the replacement level(s) used in the season statistics on this website. I have used the .350 standard although I personally think it is far too low. The reason I have done this is that I don't really have a good answer for what it should be(although I would think something in the .8-.9 region would be a reasonable compromise). Anyway, since I don't have a firm choice myself, I have decided to publish results based on the baseline chosen by plurality of the sabermetric community. This way they are useful to the most number of people for what they want to look at. Besides, you can go in on the spreadsheet and change the replacement levels to whatever the heck you want.

I am now using 73% (.350) for hitters, 125% (.390) for starters, and 111% (.450) for relievers. There is still a lot of room for discussion and debate, and I'm hardly confident that those are the best values to use.

Replacement Level Fielding

There are some systems, most notably Clay Davenport’s WARP, that include comparisons to replacement level fielders. I believe that these systems are incorrect, as are those that consider replacement level hitters; however, the distortions involved in the fielding case are much greater.

The premise of this belief is that players are chosen for their combined package of offense and defense, which shouldn’t be controversial. Teams also recognize that hitting is more important, even for a shortstop, than is fielding. Even a brilliant defensive shortstop like Mario Mendoza or John McDonald doesn’t get a lot of playing time when they create runs at 60% of the league average. And guys who hit worse than that just don’t wind up in the major leagues.

It also turns out at the major league level that hitting and fielding skill have a pretty small correlation, for those who play the same position. Obviously, in the population at large, people who are athletically gifted at hitting a baseball are going to carry over that talent to being gifted at fielding them. When you get to the major league level, though, you are dealing with elite athletes. My model (hardly groundbreaking) of the major league selection process is that you absolutely have to be able to hit at say 60% of the league average (excluding pitchers of course). If you can’t reach this level, no amount of fielding prowess is going to make up for your bat (see Mendoza example). Once you get over that hitting threshold, you are assigned to a position based on your fielding ability. If you can’t field, you play first base. Then, within that position, there is another hitting threshold that you have to meet (let’s just say it’s 88% of average for a first baseman).

The bottom line is that there is no such thing as a replacement level fielder or a replacement level fielder. There is only a replacement level player. A replacement level player might be a decent hitter (90 ARG) with limited defensive ability (think the 2006 version of Travis Lee), who is only able to fill in some times at first base, or he might be a dreadful hitter (65 ARG) who can catch, and is someone’s third catcher (any number of non-descript reserve catchers floating around baseball).

Thus, to compare a player to either a replacement level fielder or hitter is flawed; that’s not how baseball teams pick players. Your total package has to be good enough; if you are a “replacement level fielder” who can’t hit the ball out of the infielder, you probably never even get a professional contract. If you are a “replacement level hitter” who fields like Dick Stuart, well, you’d be a heck of a softball player.

However, if you do compare to a “replacement level hitter” at a given position, you can get away with it. Why? Because, as we agreed above, all players are chosen primarily for their hitting ability. It is the dominant skill, and by further narrowing things down by looking at only those who play the same position, you can end up with a pretty decent model. Ideally, one would be able to assign each player a total value (hitting and fielding) versus league average, but the nature of defense (how do you compare a first baseman to a center fielder defensively?) makes it harder. Not impossible, just harder, and since you can get away fairly well with doing it the other way, a lot of people (myself included) choose to do so.

Of course, there are others that just ignore it. I saw a NL MVP analysis for 2007 just yesterday on a well-respected analytical blog (I am not going to name it because I don’t want to pick on them) that simply gave each player a hitting Runs Above Average figure and added it to a fielding RAA figure which was relative to an average fielder at the position. The result is that Hanley Ramirez got -15 runs for being a bad shortstop, while Prince Fielder got -5 runs for being a bad first baseman. Who believes that Prince Fielder is a more valuable defensive asset to a team than Hanley Ramirez? Anyone?

Comparing to a replacement level fielder as Davenport does is bad too, but it is often not obvious to people. I hope that my logic above has convinced you why it is a bad idea; now let’s talk about the consequences of it. Davenport essentially says that a replacement level hitter is a .350 OW%, or 73 ARG hitter. This is uncontroversial and may be the most standard replacement level figure in use. But most people agreed upon this figure under the premise that it is an average defender who hits at that level. Davenport’s system gives positive value above replacement to anyone who can hit at this level, even if they are a first baseman. Then, comparing to a replacement level fielder serves as the position adjustment. Shortstops have a lower replacement level than first baseman (or, since the formula is not actually published, it seems like this is the case), and so even Hanley picks up more FRAR than Prince. However, now the overall replacement level is now much lower than .350.

So Davenport’s WARP cannot be directly compared to the RAR/WAR figures I publish, or even BP’s own VORP. If one wants to use a lower replacement level, they are free to do so, but since the point Davenport uses is so far out of line with the rest of the sabermetric community, it seems like some explanation and defense of his position would be helpful. Also, even if one accepts the need for a lower replacement level, it is not at all clear that the practice of splitting it into hitting and fielding is the best way to implement it.

Walk Like a Sabermetrician

Monday, June 15, 2020

Tripod: Baselines

No comments:

Post a Comment

Me, Elsewhere

Analysis Links

Reference Links

Blog Archive

OSU Baseball

End of Season Statistics

Win Shares Walkthrough

NL 1876-1881 Series

Labels

About Me