Tuesday, June 16, 2020

Preoccupied With 1985: Linear Weights and the Historical Abstract

I stumbled across this unpublished post while cleaning up some files – it was not particularly timely when written about ten years ago, and is even less timely now. Unlike some other old pieces I find, though, I don’t know why I never published it, other than maybe redundancy and beating a dead horse. I still agree with the opinions I expressed, and it is well above the low bar required for inclusion on this blog.

The original edition of Bill James’ Historical Baseball Abstract, published in 1985, is my favorite baseball book, and I am far from the only well-read baseball aficionado who holds it in such high regard. It contains a very engaging walk through each decade in major league history, some interesting material on rating players (including what has to be one of the first explicit discussions of peak versus career value in those terms), ratings of the best players by position and the top 100 players overall, and career statistics for about 200 all-time greats which seem like nothing in the internet age but at the time represented the most comprehensive collation on those players.

However, there is one section of the book which does not hold up well at all. It really didn’t hold up at the time, but I wasn’t in a position to judge that. James reviews The Hidden Game of Baseball, published the previous year by John Thorn and Pete Palmer, and gives his thoughts about the Linear Weights system.

James’ lifelong aversion to linear weights is somewhat legendary among those of us who delve deeply into these issues, but the discussion in the Historical Abstract is the source of the river, at least in terms of James’ published material. For years, James’ thoughts colored the perception of linear weights by many consumers of sabermetric research. This is no longer the case, as many people interested in sabermetrics twenty-five years later have never read the original book, and linear weights have been rehabilitated and widely accepted through the work of Mitchel Lichtman, Tom Tango, and now many others.

So to go back thirty years later and rake James’ essay over the coals is admittedly unfair. You may choose to look at this as gratuitous James-bashing if you please; that is not my intent, but I won’t protest any further than this paragraph. I think that some of the arguments James advances against linear weights are still heard today in different words, and occasionally you will still see a reference to the article from an old Runs Created diehard. And if one can address the concerns of the Bill James of 1985 on linear weights, it should go a long way in addressing the concerns of other critics.

It should be noted that James on the whole is quite complementary of The Hidden Game and its authors. I will be focusing on his critical comments on methodology, and so any excerpts I use will be of the argumentative variety and if taken without the disclaimer could give the wrong impression of James’ view of the work as a whole.

The first substantive argument that James offers against Palmer’s linear weights (in this case, really, the discussion is focused on the Batting Runs component) is their accuracy. The formula in question is:

BR = .46S + .80D + 1.02T + 1.40HR + .33(W + HB) + .3SB - .6CS - .25(AB - H) - .5(OOB)

As you know, Palmer’s formula uses an out value that returns an estimate of runs above average rather than absolute runs scored (in which case it would be somewhere around -.1). The formula listed by Palmer fixes the out value at -.25, but it is explained that the actual value is to be calculated for each league-season. James notes this, but then ignores it in using the Batting Runs formula to estimate team runs scored. To do so, he simply adds the above result to the league average of runs scored per team for the season. He opines that the resulting estimates are “[do] not, in fact, meet any reasonable standard of accuracy as a predictor of runs scored.”

And it’s true--they don’t. This is not because the BR formula does not work, but rather because James applied it incorrectly. As he explains, “For the sake of clarity, the formula as it appears above yields the number of runs that the teams should be above or below the league average; when you add in the league average, as I did here, you should get the number of runs that they score.”

This seems reasonable enough, but in fact it is an incorrect application of the formula. The correct way to use a linear weights above average formula to estimate total runs scored is to add the result to the league average runs/out multiplied by the number of outs the team actually made.

This can be demonstrated pretty simply by using the same league-seasons (1983, both leagues) that James uses in the initial test in the Historical Abstract. If you use the BR formula using -.25 as the out weight and simply add the result to the league average runs scored (in each respective league), the RMSE is 29.5. Refine that a little bit by adding in the number of outs each team made multiplied by the respective league runs/out (but still using -.25 as the out weight), the RMSE improves to 29.3. The James formula that uses the most comparable input, stolen base RC, has a RMSE of 24.4, and you can see why (in this limited sample; I’m certainly not advocating paying much heed to accuracy tests based on one year of data, and neither was James) he thought BR was less accurate. But had he applied the formula properly, by figuring custom out values for each league (-.255 in the AL and -.244 in the NL) and adding the resulting RAA estimate to league runs/out times team outs, he would have gotten a RMSE of 18.7.

In fairness to James, the authors of The Hidden Game did not do a great job in explaining the intricacies of linear weight calculations. The book is largely non-technical, and nitty-gritty details are glossed over. The proper method to compute total runs scored from the RAA estimate is never exactly explained, nor is the precise way to calculate the out value specific to a league-season (while it’s a matter of simple algebra, presenting the formula explicitly would have cleared up some confusion). To do a fair accuracy test versus a method like Runs Created, which does not take into account any data on league averages, you would also need to calculate the -.1 out value over a large sample and hold it constant, which Thorn and Palmer did not do or explain. In addition, the accuracy test was not as well-designed as it could have been, although that wouldn’t have had much of an impact on the results for Batting Runs or Runs Created, but rather for rate stats converted to runs.

James then goes on to explain the advantage that Batting Runs has in terms of being able to hone in on the correct value for runs scored, since it is defined to be correct on the league level. He is absolutely correct (as discussed in the preceding paragraph) that this is an unfair advantage to bestow in a run estimator accuracy test; however, it is also demonstrable that even under a fair test, Batting Runs and other similar linear weight methods acquit themselves nicely and are more accurate than comparable contemporary versions of Runs Created.

In the course of this discussion, James writes “What I would say, of course, is that while baseball changes, it changes very slowly over a long period of time; the value of an out in the American League in 1987 will be virtually identical with the value of an out in the American League in 1988.” This turned out to be an unfortunate future example for James since the AL averaged 4.90 runs/game in 1987 but just 4.36 in 1988. James’ point has merit--values should not jump around wildly for no reason other than the need to minimize RMSE--but the Batting Runs out value does not generally behave in a matter inconsistent with simply tracking changes in league scoring.

James’ big conclusion on linear weights is: “I think that the system of evaluation by linear weights is not at all accurate to begin with, does not become any more accurate with the substitution of figures derived from one season’s worth of data…Linear weights cannot possibly evaluate offense for the simplest of reasons: Offense is not linear.”

He continues “The creation of runs is not a linear activity, in which each element of the offense has a given weight regardless of the situation, but rather a geometric activity, in which the value of each element is dependent on the other elements.” James is correct that offense is not linear and that the value of any given event is dependent on the frequency of other events. But his conclusion that linear weights are incapable of evaluating offense is only supported by his faulty interpretation of the accuracy of Batting Runs. While offense is not linear, team offense is restricted to a narrow enough range that linear methods can accurately estimate team runs scored.

More importantly, James fails to recognize that while offense is dynamic, a poor dynamic estimator (such as his own Runs Created) is not necessarily (and in fact, is not) going to perform better than a linear weight method at the task of estimating runs scored. He also does not consider the problems that might be inherent in applying a dynamic run estimator directly to an individual player’s batting line, when the player is in fact a member of a team rather than his own team. Eventually, he would come to this realization and begin using a theoretical team version of Runs Created (which is one of the many reasons this criticism of his thirty-five year old essay can be viewed as unfair).

Much of the misunderstanding probably could have been avoided had Batting Runs been presented as absolute runs rather than runs above average. Palmer has never used an absolute version in any of his books, but of course many others have used absolute linear weight methods. One of the more prominent is Paul Johnson’s Estimated Runs Produced, which was brought to the public eye when none other than Bill James published Johnson's article in the 1985 Abstract annual.

Johnson’s ERP formula was dressed up in a way that made it plain to see that it was linear, but did not explicitly show the coefficient for each event as Batting Runs did. Still, it remains almost inexplicable that an analyst of James’ caliber did not see the connection between the two approaches, as he was writing two very different opinions on the merits of each nearly simultaneously.

James also applies his broad brush to Palmer’s win estimation method, saying that if you ask the Pythagorean method “If a team scores 800 runs and allows 600, how many games will they win?”, it gives you an answer (104), while “the linear weights” says “ask me after the season is over.”

The use of the phrase “wait until the season is over” is the kind of ill-conceived rhetoric that seems out of place in a James work but would be expected in a criticism of him by a clueless sportswriter. Any metric that compares to a baseline or includes anything other than the player’s own performance (such as a league average or a park factor) is going to see its output change as that independent input changes. That goes for many of James’ metrics as well (OW% for instance).

To the extent that the criticism has any validity, it should be used in the context of Batting Runs, since admittedly Palmer did not explain how to use linear weights to figure an absolute estimate of runs in the nature of Runs Created. To apply it to Palmer’s win estimator (RPW = 10*sqrt(runs per inning by both teams)) simply does not make sense. The win estimator does not rely on the league average; it accounts for the fact that each run is less valuable to a win as the total number of runs scored increases, but it doesn’t require the use of anything other than the actual statistics of the team and its opponents. (Of course, when applied to an individual player’s Batting Runs it does use the league average, which again is no different conceptually than many of James’ methods.) The Pythagorean formula with a fixed exponent has the benefit (compared to a linear estimator, even a dynamic one) of restricting W% to the range [0, 1], but it also treats all equal run ratios as translating to equal win ratios.

James concludes his essay by comparing the offensive production of Luke Easter in 1950 and Jimmy Wynn in 1968. His methods show Easter creating 94 runs making 402 outs and Wynn creating 91 runs making 413 outs, while Batting Runs shows Easter as +29 runs and Wynn +26.

James goes on to point out that the league Easter played in averaged 5.04 runs per game, while Wynn’s league averaged 3.43, and thus Wynn was the far superior offensive player, by a margin of +37 to +18 runs using RC. “Same problem--the linear weights method does not adapt to the needs of the analysis, and thus does not produce an accurate portrayal of the subject.”

In this case, James simply missed the disclaimer that the out weight varies with each league-season. While it makes sense to criticize the treatment of the league average as a known in testing the accuracy of a run estimator, it doesn’t make any sense at all to criticize using it when putting a batter’s season into context. Of course, James agrees that context is important, as he converts Easter and Wynn’s RC into baselined metrics in the same discussion.

When Batting Runs is allowed to calculate its out value as intended, it produces a similar verdict on the value of Easter and Wynn. In Total Baseball (using a slightly different but very much same in spirit Batting Runs formula), Palmer estimates Wynn at +38 and Easter at +14, essentially in agreement with from James’ estimate of +37 and +18. The concept of linear weights did not fail; James’ comprehension of it did. It doesn’t matter if that happened because Palmer and Thorn’s explanation wasn’t straightforward (or comprehensive) enough, or whether James just missed the boat, or a combination of both. Whatever the reason, the essay “Finding the Hidden Game, pt. 3” is not a fair or accurate assessment of the utility of linear weight methods and stands as the only real blemish on as good of a baseball book as has ever been written.

Monday, June 15, 2020

Tripod: Baselines

See the first paragraph of this post for an explanation of this series.

This essay will touch on the topics of various baselines and which are appropriate(in my opinion) for what you are trying to measure. In other words, it discusses things like replacement level. This is a topic that creates a lot of debate and acrimony among sabermetricians. A lot of this has to do with semantics, so all that follows is my opinion, some of it backed by facts and some of it just opinion.

Again, I cannot stress this enough; different baselines for different questions. When you want to know what baseline you want to use, first ask the question: what am I trying to measure?

Anyway, this discussion is kind of disjointed, so I'll just put up a heading for a topic and write on it.

Individual Winning Percentage

Usually the baseline is discussed in terms of a winning percentage. This unfortunate practice stems from Bill James' Offensive Winning Percentage. What is OW%? For instance, if Jason Giambi creates 11 runs per game in a context where the average team scores 5 runs per game, than Giambi's OW% is the W% you would expect when a team scores 11 runs and allows 5(.829 when using Pyth ex 2). It is important to note that OW% assumes that the team has average defense.

So people will refer to a replacement level of say .333, and what they mean is that the player's net value should be calculated as the number of runs or wins he created above what a .333 player would have done. This gets very confusing when people try to frame the discussion of what the replacement level should be in terms of actual team W%s. They'll say something like, "the bottom 1% of teams have an average W% of .300, so let's make .300 the replacement level". That's fine, but the .300 team got its record from both its offense and defense. If the team had an OW% of .300 and a corresponding DW% of .300, their record would be about .155.

Confusing, eh? And part of that comes from the silly idea of putting a player's individual performance in the form of a team's W%. So, I prefer to define replacement level in terms of percentage of the league average the player performed at. It is much easier to deal with, and it just makes more sense. But I may use both interchangeably here since most people discuss this in terms of W%.

ADDED 12/04: You can safely skip this part and understand the rest of the article; it's really about a different subject anyway. I should note the weakness of the % of league approach. The impact of performing at 120% of the league is different at different levels of run scoring. The reason for this is that the % of league for a particular player is essentially a run ratio(like runs scored/runs allowed for a team). We are saying that said player creates 20% more runs than his counterpart, which we then translate into a W% by the Pythagorean by 1.2^2/(1.2^2+1)=.590. But as you can read in the "W% Estimator" article, the ideal exponent varies based on RPG. In a 10 RPG context(fairly normal), the ideal exponent is around 1.95. But in a 8 RPG context, it is around 1.83. So in the first case a 1.2 run ratio gives a .588 W%, but in the other it gives a .583. Now this is a fairly minor factor in most cases, but we want to be as precise as possible.

So from this you might determine that indeed the W% display method is ideal, but the W% approach serves to ruin the proportional relationship between various Run Ratios(with a Pyth exponent of 2, a 2 RR gives an .800 W%, while a 1 RR gives .500, but 2 is 2 times as high as 1, not .8/.5). So the ideal thing as far as I'm concerned is to use the % of league, but translate it into a win ratio by raising it to the proper pythagorean exponent for the context(which can be figured approximately as RPG^.28). But this shouldn't have too big of an impact on the replacement level front. If you like the win ratio idea but want to convert it back into a run ratio, you can pick a "standard" league that you want to translate everybody back into(ala Clay Davenport). So if you want a league with a pyth exponent of 2, take the square root of the win ratio to get the run ratio. Generally (W/L) = (R/RA)^x or (R/RA) = (W/L)^(1/x).

Absolute Value

This is a good place to start. Why do we need a baseline in the first place? Why can't we just look at a player's Runs Created, and be done with it? Sabermetricians, I apologize, this will be quite patronizing for you.

Well, let's start by looking at a couple of players:

H D T HR W

145 17 1 19 13

128 32 2 19 25

The first guy has 68 RC, the second guy has 69. But when you discover that Player A made 338 outs and Player B made 284 outs, the choice becomes pretty clear, no? BTW, player A is Randall Simon and player B is Ivan Rodriguez(2003).

But you could say that we should have known player B was better, because we could just look at his runs/out. But of course I could give you an example of 2 guys with .2 runs/out, but one made 100 outs and had 20 RC and another made 500 outs and had 100 RC. And so you see that there must be some kind of balance between the total and the rate.

The common sense way to do this with a baseline. Some people, like a certain infamous SABR-L poster, will go to extreme lengths to attempt to combine the total and the rate in one number, using all sorts of illogical devices. A baseline is logical. It kills two or three birds with one stone. For one thing, we can incorporate both the total production and the rate of production. For another, we eventually want to evaluate the player against some sort of standard, and that standard can be the baseline that we use. And using a baseline automatically inserts an adjustment for league context.

There is value in every positive act done on a major league field. There is no way that you can provide negative absolute value. If you bat 500 times, make 499 outs and draw 1 walk, you have still contributed SOMETHING to your team. You have provided some value to that team.

But on the other hand, the team could have easily, for the minimum salary, found someone who could contribute far, far more than you could. So you have no value to the team in an economic sense. The team has no reason to pay you a cent, because they can find someone you can put up a .000/.002/.000 line panhandling on the street. This extreme example just goes to show why evaluating a major league player by the total amount of production he has put up is silly. That leads into the question of what is level at which a team can easily find a player who can play that well?

Minimum Level

This is where a lot of analysts like to draw the baseline. They will find the level at which there are dozens of available AAA players who perform that well, and that is the line against which they evaluate players. Those players are numerous and therefore have no real value to a team. They can call up another one from AAA, or find one on waivers, or sign one out of the Atlantic League. Whatever.

There are a number of different ways of describing this, though. One is the "Freely Available Talent" level. That's sort of the economic argument I spelled out. But is it really free? This might be nitpicking, but I think it is important to remember that all teams spend a great deal of money on player development. If you give your first round pick a $2 million bonus and he turns out to be a "FAT" player, he wasn't really free. Of course, he is freely available to whoever might want to take him off your hands. But I like the analogy of say, getting together with your friends, and throwing your car keys in a box, and then picking one randomly and taking that car. If you put your Chevy Metro keys in there and draw out somebody's Ford Festiva keys, you didn't get anywhere. And while you now have the Festiva, it wasn't free. This is exactly what major league teams do when they pick each other's junk up. They have all poured money into developing the talent and have given up something to acquire it(namely their junk). None of this changes the fact that it is freely available or really provides any evidence against the FAT position at all; I just think it is important to remember that the talent may be free now, but it wasn't free before. Someone on FanHome proposed replacing FAT with Readily Available Talent or something like that, which makes some sense.

Another way people define this is the level at which a player can stay on a major league 25 man roster. There are many similar ways to describe it, and while there might be slight differences, they all are getting at the same underlying principle.

The most extensive study to establish what this line is was undertaken by Keith Woolner in the 2002 Baseball Prospectus. He determined that the minimum level was about equal to 80% of the league average, or approximately a .390 player. He, however, looked at all non-starters, producing a mishmash of bench players and true FAT players.

The basic idea behind all of these is that if a player fell of the face of the earth, his team would have to replace him, and the player who would replace them would be one of these readily available players. So it makes sense to compare the player to the player who would replace him in case of injury or other calamity.

A W% figure that is often associated with this line of reasoning is .350, although obviously there is no true answer and various other figures might give a better representation. But .350 has been established as a standard by methods like Equivalent Runs and Extrapolated Wins, and it is doubtful that it will be going anywhere any time soon.

Sustenance Level

This is kind of similar to the above. This is the idea that there is some level of minimum performance at which the team will no longer tolerate the player, and will replace him. This could be other from his status on the roster or as a starting player(obviously, the second will produce a higher baseline in theory). You could also call this "minimum sustainable performance" level.

Cliff Blau attempted a study to see when regular players lost their jobs based on their RG, at each position. While I have some issues with Blau's study, such as that it did not include league adjustments while covering some fairly different offensive contexts, his results are interesting none the less. He found no black line, no one level where teams threw in the towel. This really isn't that surprising, as there are a number of factors involved in whether or not a player keeps his job other than his offensive production(such as salary, previous production, potential, defensive contribution, nepotism, etc). But Bill James wrote in the 1985 Abstract that he expected there would be such a point. He was wrong, but we're all allowed to be sometimes.

Anyway, this idea makes sense. But a problem with it is that it is hard to pin down exactly where this line is-or for that matter, where the FAT line is. We don't have knowledge of a player's true ability, just a sample of varying size. The team might make decisions on who to replace based on a non-representative sample, or the sabermetrician might misjudge the talent of players in his study and thus misjudge the talent level. There are all sorts of selective sampling issues here. We also know that the major leagues are not comprised of the 750 best players in professional baseball. Maybe Ricky Weeks could hit better right now then the Brewers' utility infielder, but they want him to play every day in the minors. The point is, it is impossible to draw a firm baseline here. All of the approaches involve guesswork, as they must.

Some people have said we should define replacement level as the W% of the worst team in the league. Others have said it should be based on the worst teams in baseball over a period of years. Or maybe we should take out all of the starting players from the league and see what the performance level of the rest of them is. Anyway you do it, there's uncertainty, large potential for error, and a need to remember there's no firm line.

Average

But the uncertainty of the FAT or RAT or whatever baseline does leave people looking for something that is defined, and that is constant. And average fits that bill. The average player in the league always performs at a .500 level. The average team always has a .500 W%. So why not evaluate players based on their performance above what an average player would have done?

There are some points that can be made in favor of this approach. For one thing, the opponent that you play is on average a .500 opponent. If you are above .500, you will win more often then you lose. If you are below .500, you will lose more often that you win. The argument that a .500+ player is doing more to help his team win then his opponent is, while the .500- player is doing less to help his team win then his opponent is, makes for a very natural demarcation: win v loss.

Furthermore, the .500 approach is inherently built into any method of evaluating players that relies on Run Expectancy or Win Expectancy, such as empirical Linear Weights formulas. If you calculate the run value of each event as the final RE value minus the initial RE value plus runs scored on the play(which is what empirical LW methods are doing, or the value added approach as well), the average player will wind up at zero. Now the comparison to zero is not inevitable; you can fudge the formula or the results to compare to a non-.500 baseline, but initially the method is comparing to average.

An argument that I have made on behalf of the average baseline is that, when looking back in hindsight on the season, the only thing that ultimately matters is whether or not the player helped you to win more than your opponent. An opponent of the average baseline might look to a .510 player with 50 PA and say that he is less valuable than a .490 player with 500 PA, since the team still had 450 additional PA with the first player. This is related to the "replacement paradox" which I will discuss later, but ignoring that issue for now, my argument back would be that it is really irrelevant, because the 450 PA were filled by someone, and there's no use crying over spilled milk. The .490 player still did less to help his team win than his opponent did to help his team win. It seems as if the minimum level is more of a forward looking thing, saying "If a team could choose between two players with these profiles, they would take the second one", which is surely true. But the fact remains that the first player contributed to wins more than his opponent. From a value perspective, I don't necessarily have to care about what might have happened, I can just focus on what did happen. It is similar to the debate about whether to use clutch hitting stats, or actual pitcher $H data, even when we know that these traits are not strongly repetitive from season to season. Many people, arguing for a literal value approach, will say that we should use actual hits allowed or a player's actual value added runs, but will insist on comparing the player to his hypothetical replacement. This is not a cut and dry issue, but it reminds us of why it is so important to clearly define what we are trying to measure and let the definition lead us to the methodology.

Average is also a comfortable baseline for some people to use because it is a very natural one. Everybody knows what an average is, and it is easy to determine what an average player's Batting Average or walk rate should be. Using a non-.500 baseline, some of this inherent sense is lost and it is not so easy to determine how a .350 player for instance should perform.

Finally, the most readily accessible player evaluation method, at least until recently, was Pete Palmer's Linear Weights system. In the catch-all stat of the system, Total Player Rating, he used an average baseline. I have heard some people say that he justified because if you didn't go .500, you couldn't make the playoffs in the later editions of Total Baseball. However, in the final published edition, on page 540, he lays out a case for average. I will quote it extensively here since not many people have access to the book:

The translation from the various performance statistics into the wins or losses of TPR is accomplished by comparing each player to an average player at his position for that season in that league. While the use of the average player as the baseline in computing TPR may not seem intuitive to everyone, it is the best way to tell who is helping his team win games and who is costing his team wins. If a player is no better than his average counterparts on other teams, he is by definition not conferring any advantage on his team. Thus, while he may help his team win some individual games during the season--just as he will also help lose some individual games--over the course of a season or of a career, he isn't helping as much as his opponents are. Ultimately, a team full of worse-than-average players will lose more games than it wins.

The reason for using average performance as the standard is that it gives a truer picture of whether a player is helping or hurting his team. After all, almost every regular player is better than his replacement, and the members of the pool of replacement players available to a team are generally a lot worse than average regulars, for obvious reasons.

If Barry Bonds or Pedro Martinez is out of the lineup, the Giants or the Red Sox clearly don't have their equal waiting to substitute. The same is typically true for lesser mortals: when an average ballplayer cannot play, his team is not likely to have an average big-league regular sitting on the bench, ready to take his place.

Choosing replacement-level performance as the baseline for measuring TPR would not be unreasonable, but it wouldn't give a clear picture of how the contributions of each player translate into wins or losses. Compared to replacement-level performance, all regulars would look like winners. Similarly, when compared to a group of their peers, many reserve players would have positive values, even though they would still be losing games for their teams. Only the worst reserves would have negative values if replacement level were chosen as the baseline.

The crux of the problem is that a team composed of replacement-level players(which would be definition be neither plus nor minus in the aggregate if replacement-level is the baseline) would lose the great majority of its games! A team of players who were somewhat better than replacement level--but still worse than their corresponding average regulars--would lose more games than it won, even though the player values(compared to a replacement-level baseline) would all be positive.

Median

This is sort of related to the average school of thought. But these people will say that since the talent distribution in baseball is something like the far right hand portion of a bell curve, there are more below average players than above average players, but the superior performance of the above average players skew the mean. The average player may perform at .500, but if you were given the opportunity to take the #15 or #16 first baseman in baseball, they would actually be slightly below .500. So they would suggest that you cannot fault a player for being in below average if he is in the top half of players in the game.

It makes some sense, but for one thing, the median in Major League baseball is really not that dissimilar to the mean. A small study I did suggested that the median player performs at about 96% of the league mean in terms of run creation(approx. .480 in W% terms). It almost a negligible difference. Maybe it is farther from the mean than that(as other studies have suggested), but either way, it just does not seem to me to be a worthwhile distinction, and most sabermetricians are sympathetic to the minimum baseline anyway, so few of them would be interested in a median baseline that really is not much different from the mean.

Progressive Minimum

The progressive minimum school of thought was first expressed by Rob Wood, while trying to reconcile the average position and the minimum position, and was later suggested independently by Tango Tiger and Nate Silver as well. This camp holds that while if a player is injured and the team must scramble to find a .350 replacement, that does not bind them to using the .350 replacement forever. A true minimal level supporter wants us to compare Pete Rose, over his whole 20+ year career, to the player that would have replaced him had he been hurt at some point during that career. But if Pete Rose had been abducted by aliens in 1965, would the Reds have still been forced to have a .350 player in 1977? No. A team would either make a trade or free agent signing to improve, or the .350 player would become better and save his job, or a prospect would eventually come along to replace him.

Now the minimum level backer might object, saying that if you have to use resources to acquire a replacement, you are sacrificing potential improvement in other areas. This may be true to some extent, but every team at some point must sacrifice resources to improve themselves. It is not as if you can run a franchise solely on other people's trash. A team that tried to do this would eventually have no fans and would probably be repossessed by MLB. Even the Expos do not do this; they put money into their farm system, and it produced players like Guerrero and Vidro. They produced DeShields who they turned into Pedro Martinez. Every team has to make some moves to improve, so advocates of the progressive or time dependent baseline will say that it is silly to value a player based on an unrealistic representation of the way teams actually operate.

So how do we know how fast a team will improve from the original .350 replacement? Rob Wood and Tango looked at it from the perspective of an expansion teams. Expansion teams start pretty much with freely available talent, but on average, they reach .500 in 8 years. So Tango developed a model to estimate the W% of such a team in year 1, 2, 3, etc. A player who plays for one year would be compared to .350, but his second year might be compared to .365, etc. The theory goes that the longer a player is around, the more chances his team has had to replace him with a better player. Eventually, a team will come up with a .500 player. After all, the average team, expending an average amount of resources, puts out a .500 team.

Another area you could go to from here is whether or not the baseline should ever rise above .500. This is something that I personally am very uneasy with, since I feel that any player who contributes more to winning than his opponent does should be given a positive number. But you could make the case that if a player plays for 15 years in the show, at some point he should have provided above average performance. This approach would lead to a curve for a player's career that would rise from .350, up over .500 maybe to say .550 at its peak, and then tailing back down to .350. Certainly an intriguing concept.

Silver went at it differently, by looking at player's offensive performance charted against their career PA. It made a logarithmic curve and he fitted a line to it. As PA increase, offensive production rapidly increases, but then the curve flattens out. Comparing Silver's work to Tango's work, the baselines at various years were similar. This was encouraging to see similar results coming from two totally different and independent approaches.

A common argument against the progressive baseline is that even if you can eventually develop a .500 replacement, the presence of your current player does not inhibit the development of the replacement, so if your player does not get hurt or disappear, you could peddle the replacement to shore up another area, or use him as a backup, or something else. This is a good argument, but my counter might be that it is not just at one position where you will eventually develop average players; it is all over the diamond. The entire team is trending toward the mean(.500) at any given time, be it from .600 or from .320. Another potential counter to that argument is that some players can be acquired as free agent signings. Of course, these use up resources as well, just not human resources.

The best argument that I have seen against the progressive level is that if a team had a new .540 first baseman every year for 20 years, each would be evaluated against .350 first baseman. But if a team had the same .540 first baseman for 20 years, he would be evaluated against a .350, then a .365, then a .385, etc, and would be rated as having less value then the total of the other team's 20 players, even though each team got the exact same amount of production out of their first base position. However, this just shows that the progressive approach might not make sense from a team perspective, but does makes sense from the perspective of an individual player's career. Depending on what we want to measure, we can use different baselines.

Chaining

This is the faction that I am most at home in, possibly because I published this idea on FanHome. I borrowed the term "chaining" from Brock Hanke. Writing on the topic of replacement level in the 1998 BBBA, he said something to the effect that win you lose your first baseman, you don't just use him. You lose your best pinch hitter, who now has to man first base, and then he is replaced by some bum.

But this got me to thinking: if the team is replacing the first baseman with its top pinch hitter, who must be a better than minimum player or else he could easily be replaced, why must we compare the first baseman to the .350 player who now pinch hits? The pinch hitter might get 100 PA, but the first baseman gets 500 PA. So the actual effect on the team when the first baseman is lost is not that it gives 500 PA to a .350 player; no, instead it gives 500 PA to the .430 pinch hitter and 100 PA to the .350 player. And all of that dynamic is directly attributable to the first baseman himself. The actual baseline in that case should be something like .415.

The fundamental argument to back this up is that the player should be evaluated against the full scenario that would occur if he has to be replaced, not just the guy who takes his roster spot. Let's run through an example of chaining, with some numbers. Let's say that we have our starting first baseman who we'll call Ryan Klesko. We'll say Klesko has 550 PA, making 330 outs, and creates 110 runs. His backup racks up 100 PA, makes 65 outs, and creates 11 runs. Then we have a AAA player who will post a .310 OBA and create .135 runs/out, all in a league where the average is .18 runs/out. That makes Klesko a .775 player, his backup a .470 player, and the AAA guy a .360 player(ignoring defensive value and the fact that these guys are first baseman for the sake of example; we're also ignoring the effect of the individual's OBAs on their PAs below-the effect might be slight but it is real and would serve to decrease the performance of the non-Klesko team below). Now in this league, the FAT level is .135 R/O. So a minimalist would say that Klesko's value is (110/330-.135)*330 = +65.5 RAR. Or, alternatively, if the AAA player had taken Kleko's 550 PA(and it is the same thing as doing the RAR calculation), he would have 380 outs and 51 runs created.

Anyway, when Klesko and his backup are healthy, the team's first baseman have 650 PA, 395 outs, and 121 RC. But what happens if Klesko misses the season? His 550 PA will not go directly to the bum. The backup will assume Klesko's role and the bum will assume his. So the backup will now make 550/100*65=358 outs and create 11/65*358=61 runs. The bum will now bat 100 times, make 69 outs, and create .135*69=9 runs. So the team total now has 427 outs and 70 RC from its first baseman. We lose 51 runs and gain 32 outs. But in the first scenario, with the bum replacing Klesko directly(which is what a calculation against the FAT line implicitly assumes), the team total would be 445 outs and 62 runs created. So the chaining subtracts 18 outs and adds 8 runs. Klesko's real replacement is the 70/427 scenario. That is .164 runs/out, or 91% of the league average, or a .450 player. That is Klesko's true replacement. A .450 player. A big difference from the .360 player the minimalists would assume.

But what happens if the backup goes down? Well, he is just replaced directly by the bum, and so his true replacement level is a .360 player. Now people will say that it is unfair for Klesko to be compared to .450 and his backup to be compared to .360. But from the value perspective, that is just the way it is. The replacement for a starting player is simply a higher level than the replacement for a backup. This seems unfair, and it is a legitamite objection to chaining. But I suggest that it isn't that outlandish. For one thing, it seems to be the law of diminishing returns. Take the example of a team's run to win converter. The RPW suggested by Pythagorean is:

RPW = RD:G/(RR^x/(RR^x+1)-.5)

Where RD:G is run differential per game, RR is run ratio, and x is the exponent. We know that the exponent is approximately equal to RPG^.29. So a team that scores 5 runs per game and allows 4 runs per game has an RPW of 9.62. But what about a team that scores 5.5 and allows 4? Their RPW is 10.11.

So a team that scores .5 runs more than another is buying their wins at the cost of an additional .49 runs. This is somewhat similar to a starting player deriving value by being better than .450, and a backup deriving value by being better than .360. Diminishing returns. Now obviously, if your starter is .450, your backup must be less than that. So maybe the chained alternative should be tied to quality in the first place. Seems unfair again? Same principle. It's not something that we are used to considering in a player evaluation method, so it seems very weird, but the principle comes into play in other places(such as RPW) and we don't think of it as such because we are used to it.

Now an alternative way of addressing this is to point out the concept of different baselines for different purposes. A starting player, to keep his starting job, has a higher sustenance level than does a backup. Now since backups max out at say 200 PA, we could evaluate everyone's first 200 PA against the .360 level and their remaining PA against the .450 level. This may seem unfair, but I feel that it conforms to reality. A .400 player can help a team, but not if he gets 500 PA.

Some other objections to chaining will invariably come up. One objection is that not all teams have a backup to plug in at every position. Every team will invariably have a backup catcher, and somebody who can play some infield positions and some outfield positions, but maybe not on an everyday basis. And this is true. One solution might be to study the issue and find that say 65% of teams have a bench player capable of playing center field. So then the baseline for centerfield would be based 65% on chaining and 35% on just plugging the FAT player into the line. Or sometimes, more than one player will be hurt at once and the team will need a FAT player at one position. Another is that a player's position on the chain should not count against him. They will say that it is not the starter's fault that he has more room to be replaced under him. But really, it's not counting against him. This is the diminishing returns principle again. If he was not a starter, he would have less playing time, and would be able to accrue less value. And if you want to give "Klesko" credit for the value that his backup has that is greater than his backup, fine. You are giving him credit for a .360 player, but only over 100 PA, rather than the minimalist, who will extend him that value over all 550 of his PA. That is simply IMO not a realistic assessment of the scenario. All of these things just demonstrate that the baseline will not in the end be based solely on chaining; it would incorporate some of the FAT level as well.

When chaining first came up on FanHome, Tango did some studies of various things and determined that in fact, chaining coupled with adjusting for selective sampling could drive the baseline as high as 90%. I am not quite as gung ho, and I'm not sure that he still is, but I am certainly not convinced that he was wrong either.

Ultimately, it comes down to whether or not we are trying to model reality as best as possible or if we have an idealized idea of value. It is my opinion that chaining, incorporated at least somewhat in the baseline setting process, best models the reality of how major league teams adjust to loss in playing time. And without loss in playing time(actually variance in playing time), everyone would have equal opportunity and we wouldn't be having this darn discussion. Now I will be the first to admit that I do not have a firm handle on all the considerations and complexities that would go into designing a total evaluation around chaining. There are a lot of studies we would need to do to determine certain things. But I do feel that it must be incorporated into any effort to settle the baseline question for general player value.

Plus-.500 Baselines

If a .500 advocate can claim that the goal of baseball is to win games and sub-.500 players contribute less to winning then do their opponent, couldn't someone argue that the real goal is to make the playoffs, and that requires say a .560 W%, so shouldn't players be evaluated against .560?

I suppose you could make that argument. But to me at least, if a player does more to help his team win than his opponent does to help his team win, he should be given a positive number of a rating. My opinion, however, will not do much to convince people of this.

A better argument is that the idea of winning pennants or making the playoffs is a separate question than just winning. Let's take a player who performs at .100 one season and at .900 in the other. The player will rate, by the .560 standard, as a negative. He has hurt his team in its quest to win pennants.

But winning a pennant is a seasonal activity. In the season in which our first player performed at .900, one of the very best seasons in the history of the game, he probably added 12 wins above average to his team. That would take an 81 win team up to 93 and put them right in the pennant hunt. He has had an ENORMOUS individual impact on his team's playoff hopes, similar to what Barry Bonds has done in recent years for the Giants.

So his team wins the pennant in the .900 season, and he hurts their chances in the second season. But is their a penalty in baseball for not winning the pennant? No, there is not. Finishing 1 game out of the wildcard chase is no better, from the playoff perspective, than finishing 30 games out. So if in his .100 season he drags an 81 win team down to 69 wins, so what? They probably weren't going to make the playoffs anyway.

As Bill James said in the Politics of Glory, "A pennant is a real thing, an object in itself; if you win it, it's forever." The .100 performance does not in any way detract from the pennant that the player provided by playing .900 in a different season.

And so pennant value is a totally different animal. To properly evaluate pennant value, an approach such as the one proposed by Michael Wolverton in the 2002 Baseball Prospectus is necessary. Using a baseline in the traditional sense simply will not work.

Negative Value/Replacement Paradox

This is a common area of misunderstanding. If we use the FAT baseline, and a player rates negatively, we can safely assume that he really does have negative value. Not negative ABSOLUTE value--nobody can have negative absolute value. But he does have negative value to a major league team, because EVERYBODY from the minimalist to the progressivists to the averagists to the chainists would agree that they could find, for nothing, a better player.

But if we use a different baseline(average in particular is used this way), a negative runs or wins above baseline figure does not mean that the player has negative value. It simply means that he has less value then the baseline he is being compared to. It does not mean that he should not be employed by a major league team.

People will say something like, ".500 proponents would have us believe that if all of the sub-.500 players in baseball retired today, there would be no change in the quality of play tomorrow". Absolute hogwash! An average baseline does not in anyway mean that its proponents feel that a .490 player has no value or that there is an infinite supply of .500 players as there are of .350 players. It simply means that they choose to compare players to their opponent. It is a relative scale.

Even Bill James does not understand this or pretends not to understand this to promote his own method and discredit TPR(which uses a .500 baseline). For instance, in Win Shares, he writes: "Total Baseball tells us that Billy Herman was three times the player that Buddy Myer was." No, that's not what it's telling you. It's telling you that Herman had three times more value above his actual .500 opponent than did Myer. He writes "In a plus/minus system, below average players have no value." No, it tells you that below average players are less valuable than their opponent, and if you had a whole team of them you would lose more than you would win.

These same arguments could be turned against a .350 based system too. You could say that I rate at 0 WAR, since I never played in the majors, and that the system is saying that I am more valuable than Alvaro Espinoza. It's the exact same argument, and it's just as wrong going the other way as it is going this way.

And this naturally leads into something called the "replacement paradox". The replacement paradox is essentially that, using a .500 baseline, a .510 player with 10 PA will rate higher than a .499 player with 500 PA. And that is true. But the same is just as true at lower baselines. Advocates of the minimal baseline will often use the replacement paradox to attack a higher baseline. But it the sword can be turned against them. They will say that nobody really cares about the relative ratings of .345 and .355 players. But hasn't a .345 player with 500 PA shown themselves to have more ability than a .355 player with 10 PA. Yes, they have. Of course, on the other hand, they have also provided more evidence that they are a below average player as well. That kind of naturally leads in to the idea of using the baseline to estimate a player's true ability. Some have suggested a close to .500 baseline for this purpose. Of course, the replacement paradox holds wherever you go from 0 to 1 on the scale. I digress; back to the replacement paradox as it pertains to the minimal level. While we may not care that much about how .345 players rate against .355 players, it is also true that we're not sure exactly where that line really is as we are with the .500 line. How confident are we that it is .350 and not .330 or .370? And that uncertainty can wreck havoc with the ratings of players who for all we know could be above replacement level.

And now back to player ability; really, ability implies a rate stat. If there is an ability to stay healthy(some injuries undoubtedly occur primarily because of luck--what could Geoff Jenkins have done differently to avoid destroying his ankle for instance), and there almost certainly is, than that is separate from a player's ability to perform when he is on the field. And performance when you are on the field is a rate, almost by definition. Now a player who has performed at a .600 level for 50 PA is certainly not equal to someone with a .600 performance over 5000 PA. What we need is some kind of confidence interval. Maybe we are 95% confident that the first player's true ability lies between .400 and .800, and we are 95% confident that the second player's true ability lies between .575 and .625. Which is a safer bet? The second. Which is more likely to be a bum? The first, by a huge margin. Which is the more likely to be Babe Ruth? The first as well. Anyway, ability is a rate and does not need a baseline.

Win Shares and the Baseline

A whole new can of worms has been opened up by Bill James' Win Shares. Win Shares is very deceptive, because the hook is that a win share is 1/3 of an absolute win. Except it's not. It can't be.

Win Shares are divided between offense and defense based on marginal runs, which for the purpose of this debate, are runs above 50% of the league average. So the percentage of team marginal runs that come from offense is the percentage of team win shares that come from offense. Then each hitter gets "claim points" for their RC above a 50% player.

Anyway, if we figure these claim points for every player, some may come in below the 50% level. What does Bill James do? He zeroes them out. He is saying that the minimum level of value is 50%, but you do not have negative value against this standard, no matter how bad you are. Then, if you total up the positive claim points for the players on the team, the percentage of those belonging to our player is the percentage of offensive win shares they will get. The fundamental problem here is that he is divvying up absolute value based on marginal runs. The distortion may not be great, but it is there. Really, as David Smyth has pointed out, absolute wins are the product of absolute runs, and absolute losses are the product of absolute runs allowed. In other words, hitters don't make absolute losses, and pitchers don't make absolute wins. The proper measure for a hitter is therefore the number of absolute wins he contributes compare to some baseline. And this is exactly what every RAR or RAA formula does.

Now, to properly use Win Shares to measure value, you need to compare it to some baseline. Win Shares are incomplete without Loss Shares, or opportunity. Due to the convoluted nature of the method, I have no idea what the opportunity is. Win Shares are not really absolute wins. I think the best explanation is that they are wins above approximately .200, masqueraded as absolute wins. They're a mess.

My Opinions

The heading here is a misnomer, because my opinions are littered all throughout this article. Anyway, I think that the use of "replacement level" to mean the FAT, or RAT, or .350 level, has become so pervasive in the sabermetric community that some people have decided they don't need to even defend it anymore, and can simply shoot down other baselines because they do not match. This being an issue of semantics and theory, and something you can't prove like a Runs Created method, you should always be open to new ideas and be able to make a rational defense of your position. Note that I am not accusing all of the proponents of the .350 replacement rate as closed-minded; there are however some who don't wish to even consider other baselines.

A huge problem with comparing to the bottom barrel baseline is intelligibility. Let's just grant for the sake of discussion that comparing to .350 or .400 is the correct, proper way to go. Even so, what do I do if I have a player who is rated as +2 WAR? Is this a guy who I want to sign to a long term contract? Is he a guy who I should attempt to improve upon through trade, or is he a guy to build my team around?

Let's say he's a full time player with +2 WAR, making 400 outs. At 25 outs/game, he has 16 individual offensive games(following the IMO faulty OW% methodology as discussed at the beginning), and replacement level is .350, so his personal W% is .475. Everyone, even someone who wants to set the baseline at .500, would agree that this guy is a worthwhile player who can help a team, even in a starting job. He may contribute less to our team than his opponent does to his, but players who perform better than him are not that easy to find and would be costly to acquire.

So he's fine. But what if my whole lineup and starting rotation was made up of +2 WAR players? I would be a below average team at .475. So while every player in my lineup would have a positive rating, I wouldn't win anything with these guys. If I want to know if my team will be better than the team of my opponent, I'm going to need to figure out how many WAR an average player would have in 400 outs. He'd have +2.4, so I'm short by .4. Now I have two baselines. Except what is my new baseline? It's simply WAA. So to be able to know how I stack up against the competition, I need the average baseline as well.

That is not meant to be a refutation of value against the minimum; as I said, that theory of value could be completely true and the above would still hold. But to interpret the results, I'm going to need to introduce, whether intentionally or not, a higher baseline. Thus, I am making the case here that even if the minimum baseline properly describes value, other baselines have relevance and are useful as well.

As for me, I'm not convinced that it does properly define value. Why must I compare everybody to the worst player on the team? The worst player on the team is only there to suck up 100 PA or 50 mop up IP. A starting player who performs like the worst player is killing my team. And chaining shows that the true effect on my team if I was to lose my starter is not to simply insert this bum in his place in many cases. So why is my starter's value based on how much better he is than that guy?

Isn't value supposed to measure the player's actual effect on the team? So if the effect on my team losing a starter is that a .450 composite would take his place, why must I be compelled to compare to .350? It is ironic that as sabermetricians move more and more towards literal value methods like Win Expectancy Added and Value Added Runs or pseudo-literal value methods like Win Shares, which stress measuring what actually happened and crediting to the player, whether we can prove he has an ability to repeat this performance or not, they insist on baselining a player's value against what a hypothetical bottom of the barrel player would do and not by the baseline implied by the dynamics of a baseball team.

I did a little study of the 1990-1993 AL, classifying 9 starters and 4 backups for each team by their primary defensive position. Anyone who was not a starter or backup was classified as a FAT player. Let me point out now that there are numerous potential problems with this study, coming from selective sampling, and the fact that if a player gets hurt, his backup may wind up classified as the starter and the true starter as the backup, etc. There are many such issues with this study but I will use it as a basis for discussion and not as proof of anything.

The first interesting thing is to look at the total offensive performance for each group. Starters performed at 106.9 in terms of adjusted RG, or about a .530 player. Backups came in at 87.0, or about .430. And the leftovers, the "FAT" players, were at 73.8, or about .350.

So you can see where the .350 comes from. The players with the least playing time have an aggregate performance level of about .350. If you combine the totals for bench players and FAT players, properly weighted, you get 81.7, or about .400. And there you can see where someone like Woolner got 80%--the aggregate performance of non-starters.

Now, let's look at this in terms of a chaining sense. There are 4 bench players per team, and let's assume that each of them can play one position each on an everyday basis. We'll ignore DH, since anyone can be a DH. So 50% of the positions have a bench player, who performs at the 87% level, who can replace the starter, and the other half must be replaced by a 73.8%. The average there is 80.4%, or about .390. So even using these somewhat conservative assumptions, and even if the true FAT level is .350, the comparison point for a starter is about .390, in terms of the player who can actually replace him.

Just to address the problems in this study again, one is that if the starter goes down, the backup becomes the starter and the FAT guy becomes the backup. Or the backup goes downit doesn't adjust for the changes that teams are forced to make, that skew the results. Another is that if a player does not perform, he may be discarded. So there is selective sampling going on. A player might have the ability to perform at a .500 level, but if he plays at a .300 level for 75 PA, they're going to ship him out of their. This could especially effect the FAT players; they could easily be just as talented as the bench players, but hit poorly and that's why they didn't get enough PA to qualify as a bench player.

The point was not to draw conclusions from the study. Anyway, why can't we rate players against three standards? We could compare(using the study numbers even though we know they are not truly correct) to the .530 level for the player's value as a starter; to .430 for the player's value as a bench player; and to .350 for a player's value to play in the major leagues at any time. Call this the "muti-tiered" approach. And that's another point. The .350 players don't really have jobs! They get jobs in emergencies, and don't get many PA anyway. The real level of who can keep a job, on an ideal 25 man roster, is that bench player. Now if the average bench player is .430, maybe the minimum level for a bench player is .400.

Anyway, what is wrong with having three different measures for three different areas of player worth? People want one standard on which to evaluate players, for the "general" question of value. Well why can't we measure the first 50 PAs against FAT, the next 150 against the bench, and the others against a starter? And you could say, "Well, if a .400 player is a starter, that's not his fault; he shouldn't have PAs 200-550 measured against a starter." Maybe, maybe not. If you want to know what this player has actually done for his team, you want to take out the negative value. But if you're not after literal value, you could just zero out negative performances.

A player who gets 150 PA and plays at a .475 level has in a way helped his team, relative to his opponent. Because while the opponent as a whole is .500, the comparable piece on the other team is .430. So he has value to his team versus his opponent; the opponent doesn't have a backup capable of a .475 performance. But if the .475 player gets 600 PA, he is hurting you relative to your opponent.

And finally, let's tie my chaining argument back in with the progressive argument. Why should a player's career value be compared to a player who is barely good enough to play in the majors for two months? If a player is around for ten years, he better at some point during his career at least perform at the level of an average bench player.

Now the funny thing is that I am about to end this article, I am not going to tell you what baseline I want to use, if I was forced to choose one baseline. I know for sure it wouldn't be .350. It would be something between .400 and .500. .400 on the conservative side, evaluating the raw data without taking its biases into account. .500 if I say "screw it all, I want literal value." But I think I will use .350, and publish RAR, because even though I don't think it is right, it seems to be the consensus choice among sabermetricians, many of whom are smarter than me. Is that a sell out? No. I'll gladly argue with them about it any time they want. But since I don't really know what it is, and I don't really know how to go about studying it to find it and control all the biases and pitfalls, I choose to error on the side of caution. EXTREME caution(got to get one last dig in at the .350 level).

I have added a spreadsheet that allows you to experiment with the "multi-tiered" approach. Remember, I am not endorsing it, but I am presenting it as an option and one that I personally think has some merit.

Let me first explain the multi-tiered approach as I have sketched it out in the spreadsheet. There are two "PA Thresholds". The first PA threshold is the "FAT" threshold; anyone with less PA than this is considered a FAT player. The second is the backup threshold; anyone with more PA than that is a starter. Anyone with a number of PA in between is the two thresholds is a backup.

So, these are represented as PA Level 1 and PA Level 2 in the spreadsheet, and are set at 50 and 200. Play around with those if you like.

Next to the PA Levels, there are percentages for FAT, backup, and starter. These are the levels at which value is considered to start for a player in each class, expressed as a percentage of league r/o. Experiment with these too; I have set the FAT at .73, the backup at .87, and the starter at 1.

Below there, you can enter the PA, O, and RC for various real or imagined players. "N" is the league average runs/game. The peach colored cells are the ones that do the calculations, so don't edit them.

RG is simply the players' runs per game figure. V1 is the player's value compared to the FAT baseline. V2 is the player's value compared to the backup baseline. V3 is the player's value compared to the starter baseline. v1 is the player's value against the FAT baseline, with a twist; only 50 PA count(or whatever you have set the first PA threshold to be). Say you have a player with 600 PA. The multi-tiered theory holds that he has value in being an above FAT player, but only until he reaches the first PA threshold. Past that, you should not have to play a FAT player, and he no longer has value. If the player has less than 50 PA, his actual PA are used.

v2 does the same for the backup. Only the number of PA between the two thresholds count, so with the default, there are a maximum of 150 PA evaluated against this level. If the player has less than the first threshold, he doesn't get evaluated here at all; he gets a 0.

v3 applies the same concept to the starter level. The number of PA that the player has over the second threshold are evaluated against the starter level. If he has less than the second threshold, he gets a 0.

SUM is the sum of v1, v2, and v3. This is one of the possible end results of the multi-tiered approach. comp is the baseline that the SUM is comparing against. A player's value in runs above baseline can be written as:

(R-xN)*O/25 = L

Where x is the baseline, 25 is the o/g default, and L is the value. We can solve for x even if the equation we use to find L does not explicitly use a baseline(as is the case here):

x = (RG-25*L/O)/N

So the comp is the effective baseline used by the multi-tiered approach. As you will see, this will vary radically, which is the point of the multi-tiered approach.

The +only SUM column is the sum of v1, v2, and v3, but only counting positive values. If a player has a negative value for all 3, his +only SUM is 0. If a player has a v1 of +3, a v2 of +1, and a v3 of -10, his +only SUM is +4, while his SUM would have been -6. The +only SUM does nto penalize the player if he is used to an extent at which he no longer has value. It is another possible final value figure of the multi-tiered approach. The next column, comp, does the same thing as the other comp problem, except this time it is based on the +only SUM rather than the SUM.

So play around with this spreadsheet. See what the multi-tiered approach yields and whether you think it is worth the time of day.

One of the objections that would be raised to the multi-tiered approach is that there is a different baseline for each player. I think this is the strength of it, of course. Think of it like progressive tax brackets. It is an uncouth analogy for me to use, but it helps explain it, and we're not confiscating property from these baseball players. So, let's just say that the lowest bracket starts at 25K at 10%, and then at 35K it jumps to 15%. So would you rather make 34K or 36K?

It's an absurd question, isn't it? At 34K, you will pay $900 in taxes, while at 36K you will pay $1150, but your net income is $33,100 against $34,850. Of course you want to make 36K, even if that bumps you into the next bracket.

The same goes for the multi-tiered value. Sure, a player who performs at 120% of the league average in 190 PA is having his performance "taxed" at 83%, and one who has 300 PA is being "taxed" at 89%. But you'd still rather have the guy in 300 PA, and you'd still rather make 36 grand.

Now, just a brief word about the replacement level(s) used in the season statistics on this website. I have used the .350 standard although I personally think it is far too low. The reason I have done this is that I don't really have a good answer for what it should be(although I would think something in the .8-.9 region would be a reasonable compromise). Anyway, since I don't have a firm choice myself, I have decided to publish results based on the baseline chosen by plurality of the sabermetric community. This way they are useful to the most number of people for what they want to look at. Besides, you can go in on the spreadsheet and change the replacement levels to whatever the heck you want.

I am now using 73% (.350) for hitters, 125% (.390) for starters, and 111% (.450) for relievers. There is still a lot of room for discussion and debate, and I'm hardly confident that those are the best values to use.

Replacement Level Fielding

There are some systems, most notably Clay Davenport’s WARP, that include comparisons to replacement level fielders. I believe that these systems are incorrect, as are those that consider replacement level hitters; however, the distortions involved in the fielding case are much greater.

The premise of this belief is that players are chosen for their combined package of offense and defense, which shouldn’t be controversial. Teams also recognize that hitting is more important, even for a shortstop, than is fielding. Even a brilliant defensive shortstop like Mario Mendoza or John McDonald doesn’t get a lot of playing time when they create runs at 60% of the league average. And guys who hit worse than that just don’t wind up in the major leagues.

It also turns out at the major league level that hitting and fielding skill have a pretty small correlation, for those who play the same position. Obviously, in the population at large, people who are athletically gifted at hitting a baseball are going to carry over that talent to being gifted at fielding them. When you get to the major league level, though, you are dealing with elite athletes. My model (hardly groundbreaking) of the major league selection process is that you absolutely have to be able to hit at say 60% of the league average (excluding pitchers of course). If you can’t reach this level, no amount of fielding prowess is going to make up for your bat (see Mendoza example). Once you get over that hitting threshold, you are assigned to a position based on your fielding ability. If you can’t field, you play first base. Then, within that position, there is another hitting threshold that you have to meet (let’s just say it’s 88% of average for a first baseman).

The bottom line is that there is no such thing as a replacement level fielder or a replacement level fielder. There is only a replacement level player. A replacement level player might be a decent hitter (90 ARG) with limited defensive ability (think the 2006 version of Travis Lee), who is only able to fill in some times at first base, or he might be a dreadful hitter (65 ARG) who can catch, and is someone’s third catcher (any number of non-descript reserve catchers floating around baseball).

Thus, to compare a player to either a replacement level fielder or hitter is flawed; that’s not how baseball teams pick players. Your total package has to be good enough; if you are a “replacement level fielder” who can’t hit the ball out of the infielder, you probably never even get a professional contract. If you are a “replacement level hitter” who fields like Dick Stuart, well, you’d be a heck of a softball player.

However, if you do compare to a “replacement level hitter” at a given position, you can get away with it. Why? Because, as we agreed above, all players are chosen primarily for their hitting ability. It is the dominant skill, and by further narrowing things down by looking at only those who play the same position, you can end up with a pretty decent model. Ideally, one would be able to assign each player a total value (hitting and fielding) versus league average, but the nature of defense (how do you compare a first baseman to a center fielder defensively?) makes it harder. Not impossible, just harder, and since you can get away fairly well with doing it the other way, a lot of people (myself included) choose to do so.

Of course, there are others that just ignore it. I saw a NL MVP analysis for 2007 just yesterday on a well-respected analytical blog (I am not going to name it because I don’t want to pick on them) that simply gave each player a hitting Runs Above Average figure and added it to a fielding RAA figure which was relative to an average fielder at the position. The result is that Hanley Ramirez got -15 runs for being a bad shortstop, while Prince Fielder got -5 runs for being a bad first baseman. Who believes that Prince Fielder is a more valuable defensive asset to a team than Hanley Ramirez? Anyone?

Comparing to a replacement level fielder as Davenport does is bad too, but it is often not obvious to people. I hope that my logic above has convinced you why it is a bad idea; now let’s talk about the consequences of it. Davenport essentially says that a replacement level hitter is a .350 OW%, or 73 ARG hitter. This is uncontroversial and may be the most standard replacement level figure in use. But most people agreed upon this figure under the premise that it is an average defender who hits at that level. Davenport’s system gives positive value above replacement to anyone who can hit at this level, even if they are a first baseman. Then, comparing to a replacement level fielder serves as the position adjustment. Shortstops have a lower replacement level than first baseman (or, since the formula is not actually published, it seems like this is the case), and so even Hanley picks up more FRAR than Prince. However, now the overall replacement level is now much lower than .350.

So Davenport’s WARP cannot be directly compared to the RAR/WAR figures I publish, or even BP’s own VORP. If one wants to use a lower replacement level, they are free to do so, but since the point Davenport uses is so far out of line with the rest of the sabermetric community, it seems like some explanation and defense of his position would be helpful. Also, even if one accepts the need for a lower replacement level, it is not at all clear that the practice of splitting it into hitting and fielding is the best way to implement it.

Monday, June 08, 2020

Tripod: W% Estimators

See the first paragraph of this post for an explanation of this series. Most of the below has been superseded by posts on this blog, particularly the discussion of Bill Kross' method.

While there are not quite as many winning percentage estimators as there are run estimators, there is no shortage of them. Maybe one of the reasons is that the primary method for determining W%, Bill James' Pythagorean method, is a fairly good method that does not have the obvious flaws of Runs Created--although it does have some. The inadequacy of Runs Created has always fueled innovation in the run estimation field.

BenV-L from the FanHome board has provided a classification system for win estimators, which is a little complex but does indeed make sense. He is a genuine math/stats guy, so I won't tread on his territory. I will propose a different classification system that approaches it from a slightly different angle.

First, we have the general area of methods that do not vary based on the run context. Under this, we have linear and non-linear methods. So, first we will look at static linear methods.

The static linear methods all are based in some way on runs minus runs allowed. Most of them take this general form:

W% = RD:G*S + .5

Where RD:G is Run Differential Per Game and S is slope. This is in the form of a basic linear regression, mx + b. Another way to write this, also very common, is:

W% = RD:G/RPW + .5

Where RPW is Runs Per Win, which of course is just the reciprocal of the slope. Of course, you could call the slope Wins Per Run, but I prefer sticking with the regression lingo.

It turns out that for average major league play, the slope turns out to be about .1, or an RPW of 10. For instance, I often use a value of .107 which is based on a regression on 1970s data(more out of habit than anything). However, using regression you can generate a formula that does not weight R and RA equally. One of these methods was published by Arnold Soolman based on 1901-1970 data. He had W% = (.102*R-.103*RA)/G + .505. This equation appears to be based on multiple regression. While it is not inevitable that R and RA be given equal weight, and that a team that scores as many runs as it allows is predicted at .500, it seems like an inevitable choice to me.

Looking at Soolman's formula, a team that scores and allows 4 runs per game is predicted to play .501 baseball. This doesn't seem like a big deal, but let's consider the case of a league that has an average of 4 runs per game. The league would be predicted to play .501 baseball, which is obviously impossible. They would have to play .500 baseball. That is my logic for R=RA=.500 W%, and whether it is good enough is up to you.

We also have non-linear methods that use constants. Earnshaw Cook was the first to actually publish a W% estimator, and it is in this category:

W% = R*.484/RA

A team with equal R and RA would be .484, but if you use .5 instead, it will work.

Another example is the work of Bill Kross:

if RRA, W% = 1 - RA/2*R

Another is a method that Bill James speculated would work, but never actually used, is "double the edge". This is as follows:

W% = (R/RA*2-1)/(R/RA*2)

The problem with many of these methods is that they obviously break down at the extremes. Using a slope of .1 with the linear method causes a W% of 1 at a RD:G of 5. But a team that scores 5.1 runs per game more than it's opponent will not play 1.01 baseball. Cook's formula produces a W% over 1 with a run ratio over 2.07, although it doesn't allow a sub-zero W%, and isn't accurate at all. The Kross formula simply does not provide a very accurate estimation, at least in comparison to other methods, although it does bound W% between 0 and 1. Double the Edge does not allow a W% above 1, but if the team's run ratio is under .5, it will produce a sub-zero winning percentage.

So every method either is inaccurate or produces impossible answers. While all of these formula will work decently with normal teams in normal scoring contexts, we need methods that work outside of the normal range. There are real .700 teams, and there are teams that play in a context where the two teams average 13 runs a game. And if we want to apply these methods to individuals at all, we definitely need a more versatile method.

Enter the Pythagorean Theorem. Bill James' formula, W% = R^2/(R^2 + RA^2), has a high degree of accuracy and fits the constraints of 0 and 1. These attributes and its relative simplicity has made it the standard for many years. James would later proclaim that 1.83 was a more optimal exponent. The formula by which he came to this conclusion was exponent = 2-1/(RPG-3). At the normal RPG of 9, this does produce an exponent of 1.83, but it provides a maximum possible exponent of 2.333 at 0 RPG and a minimum possible exponent of 2 at infinity RPG, which as we shall see later is a woefully inadequate and illogical formula.

An off-the-wall sort of formula developed by the author is based on an article in the old STATS Pro Football Revealed work, which estimate the Z-score of winning percentage for a team and then converted it back into a W%. I applied this idea to baseball. It is automatically bounded by 0 and 1. Anyway, I estimated Z-score as 2.324*(R-RA)/(R+RA), and then you can use the normal cumulative function to convert it back into a W%.

Now we go into methods that vary somewhat based on the scoring context. This is normally done in terms of Runs Per Game, (R+RA)/G. First, I should just point out that it might be possible to modify the Z-score W% and the Double the Edge method to somehow account for changing RPG, but no one has done so and since the methods aren't optimal, it would probably be a waste of time.

The linear methods that do this simply use a formula based on RPG to estimate RPW or slope before estimating W%. These linear formulas are still subject to the same caveats as the static linear methods--they are not bounded by 0 and 1. But they do add more flexibility, especially within the normal scoring ranges. There are a number of these methods, all of which produce very similar results as BenV-L found. The most famous of these is one developed by Pete Palmer, RPW = 10*sqrt(RPG/9). Some others include David Smyth's (R-RA)/(R+RA) + .5 = W%, which just assumes that RPW = RPG. Ben V-L published the same formula except multiplying (R-RA)/(R+RA) by .91, making RPW = 1.099*RPG. Just for example, another one is Tango Tiger's simple RPG/2 + 5. Again, the accuracy is improved more by using any reasonable modified slope than by finding the optimum one from out of these choices.

Of course, as we said the problems inherent in linear methods are not resolved just by using a flexible slope. The Pythagorean model provides the bounds at 0 and 1, and is what we want to build upon. This will take the form of R^X/(R^X + RA^X).

There have been several published attempts to base X on RPG. One very simple one is RPG/4.6 from David Sadowski. The most famous is Clay Davenport's "Pythagenport", X = 1.5log(RPG) + .45. Davenport used some extreme data and modelling to find his optimal exponent, which claims to be accurate for RPG ranging between 4 and 40.

What about RPG under 4 though? Enter David Smyth. The inventor of Base Runs, the "natural" RC function, came up with a brilliant discovery, revelation, or what have you that allows for the finding of a better exponent. Although it is a remarkable obvious conclusion, once you have been exposed to it, no one outside of Mr. Smyth was able to think it up themselves.

The concept is very simple. The minimum RPG possible in a game is 1, because if neither team scores, the game continues to go. And if a team played 162 games at 1 RPG, they would win each game they scored a run and lose each time they allowed a run. Therefore, to make W/(W+L) = R^X/(R^X + RA^X), X must be set equal to 1. This is a known point in the domain of the exponent: (1,1). Sadowski's formula would give an exponent of .22 at 1 RPG, causing a team that should go 100-62(.617) to be predicted at .526. Davenport comes up with .45, which would project a .554 W% for the team--closer, but still incorrect, and our formula has to work at the only point that we know to be true.

So the search was on for an exponent that would 1) produce 1 at 1 RPG 2)maintain accuracy for real major league teams and 3) be accurate at high RPG. If criteria 1 and 2 were met, but 3 was not, than the Davenport method would be preferable at some times, and the new method would be preferable at others. We want a method that can give us a reasonable estimate all of the time.

It turns out that this author, while fooling around with various regression models fed by the known point and Davenport's exponent at other points, found that RPG^.29 matched Davenport's method in the range where a match was desired. Although I posted it on FanHome, nobody really noticed. A few months later, David Smyth posted RPG^.287, saying that he thought it was an exponent that would fit all of our needs. Bingo. Tango Tiger ran some tests which are linked below and found that RPG^.28 might be the best, but the Patriot/Smyth exponent is the one that, at least to this time, has been shown to produce the optimal results. Some people have taken to calling this Pythagenpat, a takeoff on Pythagenport, but it should always be remembered that Smyth recognized the usefulness of this method to a greater extent than I did and that without his (1,1) discovery, I would have never been attempting to develop an exponent.

Let's just close by illustrating the differences between the various methods for a team that is fairly extreme--they outscore their opponents by a 2:1 ratio in a 5 RPG context(3.33 r/g, 1.67 ra/g):

Model EW%

Cook .968

Kross .750

10 RPW .666

Pyth(X=2) .800

Palmer .723

Sadowski .680

Davenport .739

Patriot/Smyth .751

Although all of these methods with the glaring exception of Cook give a similar standard error when applied to normal major league teams, the differences are quite large when extreme teams are involved. And while a method like Kross might track the Pythagenpat well in this case, there are other cases where it will not. The same goes for all of the methods, although Pythagenport and Pythagenpat are basically equivalent from around 5 to 30 RPG as you can see in the chart linked on this page.

Although linear models do not have the best theoretical accuracy, there are certain situations in which they can come in handy. What I did was use the Pythagenpat method as the basis for a slope formula. We can calculate the slope that is in effect for a team at any given point based on the Pythagorean method by knowing the exponent x(which I figured by Pythagenpat), the Run Ratio, and the RPG. The formula for this, originally published by Smyth but in a different form, is S = (RR^x/(RR^x+1)-.5)/(RPG*(2*RR/(RR+1)-1)) What I did was calculate the needed slope for a team with RR 1.01, 1.05, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, and 2 at each 1 RPG interval from 1-14. I then attempted to regress for a formula for slope based on RPG. I eventually decided to cut out the teams from 1-4 RPG because they simply were too different to fit into the model. But using the teams at 5-14, I came up with an equation that works fairly well in that range, S = .279-.182*log(RPG). You can see in another set of charts linked below the needed slope at 1-14 RPG and a chart showing the actual needed slope(markedSeries 3) and the predicted slope(Series 2). The fit is pretty good in that range, but caution should be used if you try to take it outside of the tested region. Applied to actual 1961-2000 teams projected to 162 games, it has a RMSE of 4.015, comparable to the most accurate methods.

Finally, I think it would be useful to observe that in all of these methods, four basic components pop up a lot: Runs Per Game(RPG), Run Ratio(RR), Run Percentage(R%), and Run Differential Per Game(RD:G). I have provided the formulas for each of these and formulas that you can use to convert between themselves, something that is technical math crap instead of real sabermetric knowledge, but I find the conversion formulas useful:

RPG = (R+RA)/G

RR = R/RA

R% = R/(R+RA)

RD:G = (R-RA)/G

RR = R%/(1-R%)

RR = (RD:G/(2*RPG)+.5)/(.5-RD:G/(2*RPG))

R% = RR/(RR+1)

R% = (RD:G+RPG)/(2*RPG) = RD:G/(2*RPG) + .5

RD:G = RPG*(2*R%-1)

RD:G = RPG*(2*RR/(RR+1)-1)

ADDED 12/2004: Pythagorean RPW and Slope Based on Calculus

NOTE: Some(most? All?) of the insights and formulas in this section have previously been published and discussed in some way, shape, or form by Ben Vollmayr-Lee on Primer or Fanhome and by Ralph Caola in the November 2003, February 2004 and May 2004 "By the Numbers". However, for the sake of my sanity, I am going to write this only acknowledging their work when I need it to further my discussion. But I just want it to be clear that I am not the first to publish this sort of stuff and while I did have the idea to do some of this before I saw Caola's piece, he published and almost certainly actually did it first. No slight to him or Ben intended at all. Also, while other places on this page I have used RD:G to mean run differential per game, I will now use RD to mean RD:G as well to make these long formulas less difficult to read. Now…

The derivative of Run Differential with respect to Pythagorean Winning Percentage should be the marginal number of runs per win. By marginal, I mean the number of runs necessary to add one more win. The marginal RPW given by the derivative process will not be the actual RPW that the team purchased wins at. That can be found through Pyth, in this formula from David Smyth(I have given its equivalent formula above, which is for slope):

RPW = 2*(R-RA)*(R^x + RA^x)/(R^x - RA^x)

Where everything is in terms of per game. Although I gave an equivalent to that above, I don't think I explained how to derive it, so I will now. Start with:

W% = RD/RPW + .5

Rearrange to get RPW = RD/(W% - .5)

Now substitute Pythagorean W% for W% to get:

RPW = (R-RA)/(R^x/(R^x + RA^x) - .5)

I haven't done the algebra, but trust me, that equation is equivalent to the Smyth equation above. As you can see from Smyth's version, when R = RA, the fraction and therefore RPW will be undefined. It turns out that at that point, the marginal RPW as derived below will fill in and make the function continuous.

To go about getting dW%/dRD, we can first write:

W% = RR^x/(RR^x + 1)

and the identity that:

RR = (RD/(2*RPG) + .5)/(.5 - RD/(2*RPG)).

We differentiate that with respect to RD to get:

dRR/dRD = ((.5-RD/(2*RPG))*(1/(2RPG))-(RD/(2*RPG) + .5)*(-1/(2*RPG)))/(.5 - RD/(2*RPG))^2.

That simplifies to:

dRR/dRD = 1/(2*RPG*(.5-RD/(2*RPG))^2)

We can then take the derivative of W% with respect to RR:

dW%/dRR = ((RR^x + 1)*(x*RR^(x-1))-RR^x*(x*RR^(x-1)))/(RR^x + 1)^2

We can then use:

dW%/dRR * dRR/dRD = dW%/dRD

Which is slope. Doing the math, we have:

S = x*RR^(x-1)/(2*RPG*(RR^x + 1)^2*(.5 - RD/(2*RPG))^2)

The reciprocal is RPW:

RPW = ((2*RPG*(RR^x + 1)^2*(.5 - RD/(2*RPG))^2)/(x*RR^(x-1))

Caola worked out RPW for when x = 2 and got:

RPW = (RPG^2 + RD^2)^2/(RPG*(RPG^2 - RD^2))

Which is equivalent to mine when x = 2.

This is all potentially useful, but we don't really care as much about the RPW for specific teams as we do about RPW for given levels of RPG. What we can do is run these formulas for an average team at a given RPG; one with a RR of one and a RD of zero(R = RA). If you do this with Caola's formula at exponent 2, you get:

RPW = (RPG^2 + 0^2)^2/(RPG*(RPG^2 - 0^2)) = RPG^4/RPG^3 = RPG

So at x = 2, an average team will have a RPW equal to their RPG, which is a very common sense approximation that people have used for RPW. What about exponents other then two though? Using my formula:

RPW = 2*RPG*(1^x + 1)^2*(.5 - 0/(2*RPG))^2/(x*1^(x-1)) = 2*RPG/X

Or slope = X/(2*RPG)

I believe this has been published by David Glass on rsbb almost a decade ago. Anyway, if you use a pyth exponent of 1.82, you will get a slope of 1.82/(2*RPG) = .91*RPG...the Ben V-L .91*(R-RA)/(R+RA) formula for winning percentage. Since 1.82 is a very good historical estimate for Pythagorean exponent, it is then no surprise that the .91 multiplier gives one of the very best simple linear fits for W%.

Above, I mentioned Bill James' "Double the Edge" method. This method is not a particularly accurate one, but it has some cool properties and I think it is worth spending a little bit of time on. First, let's define Win Ratio(WR = W/L) and of course Run Ratio(RR = R/RA). DTE states that the relationship between them is:
WR = 2*RR-1
Pythagorean states that:
WR = RR^x or in the most common form, WR = RR^2
Then, in DTE or Pyth: W% = WR/(WR + 1) and that WR = W%/(1-W%).
DTE states the relationship between RR and WR in a linear regression fasion, y = mx + b, where y is WR, m is the slope, x is the RR, and b is the intercept. So one way to find m and b is to do a linear regression.

But there is another way as well, and that is to do some calculus based on Pythagorean. In calculus, you can find the tangent line to a function at a given point. This line has the slope of the derivative of the function at that point, and intersects the graph of the function. For a better explanation, find a calc professor or something, because that's as best as I can describe it. Anyway, in the relationship between WR and RR, there is one point, a "known point" at which we know the exact relationship between them. That point is that when a team scores the same number of runs it allows(RR = 1), it will be a .500 team(WR = 1). What is the tangent line to Pythagorean at this point?

We can write the equation of the tangent line as:
y - y1 = m*(x - x1)
Where y is WR, y1 is the given WR(1), m is the slope or derivative of the Pythagorean equation at that point, x is the RR, and x1 is the given RR(1). The derivative of WR = RR^x with respect to RR is dWR/dRR = x*RR. So at RR = 1, the derivative is simply x--the pythagorean exponent. We'll use x = 2 for Pyth, and substitute everything in:
WR - 1 = 2*(RR - 1)
This simplifies to:
WR = 2*RR - 1
Double the Edge. So Double the Edge is actually the tangent line of the Pythagorean WR equation at the point where RR = WR = 1. The general form of the equation is:
WR - WR1 = (x*RR)*(RR - RR1)
This tangent approximation could be used at any RR/WR combo--except we do not intrinsically know what the WR should be for a 1.5 RR team for instance. That's why we need Run-to-Win converters like Pyth in the first place. So we cannot use the general form, and the one that we can use for all team is the one based on WR = RR = 1.

There is one other known point, WR = RR = 0. However, if you base a DTE equation on this, it is only close to Pyth at EXTREMELY low Run Ratios, and will give RR = 1 team WR > 1.

The basic DTE equation caps W% at 1, but any team with a RR of .5 will have a 0 W%, so it doesn't have the upper AND lower bounds which make Pyth unique among most run estimators. As it turns out(and not surprisingly if you've ever looked at the graph of a parabola versus the graph of a straight line), DTE does not "kick in" fast enough at high RRs. This can be seen through the calculus to find the RPW based on DTE. dW%/dRR from DTE is :
dW%/dRR = (x*(x*RR+2-x)-x*(x*RR-x+1))/(x*RR+2-x)^2
dRR/dRD = 1/((2*RPG*(.5-RD/(2*RPG))^2)
So dW%/dRR*dRR/dRD = dW%/dRD = slope = 1/RPW:
dW%/dRD = x/((2*RPG*(.5-RD/(2*RPG))^2*(x*RR+2-x)^2)
If you try that for an average team(RD = 0, RR = 1)
slope = x/(2*RPG*1/4*4) = x/(2*RPG)
Which is the same result you get from Pythagorean, unsurprising since it matches the Pyth value at the known point. Anyway, though, if you put in a team with 10 RPG and 1 RD(1.222 RR) you get a Pyth RPW of 10.30 and a DTE RPW of 12.10. It is setting the "win purchase price" way too high and that's why it doesn't work well for high RR teams. But for ordinary teams, DTE is about as accurate as anything else.

Just as 2 is not necessarily the most accurate exponent for Pyth, 2 is not necessarily the most accurate slope for DTE. You can use regression or various Pyth exponent formulas to find the best exponent/slope(they are the same, at least at the known point). If you do this:
WR = x*RR - (x - 1)
So W% = (x*RR - (x - 1))/(x*RR - (x - 1) + 1) = (x*RR-(x-1))/(x*RR + (2 - x))

Another thing...this same concept applies to other sabermetric formulas. For example, Davenport had/has two forms for EQR from RAW. One is (RAW/LgRAW)^2 = (R/PA)/(LgR/PA) and the other is 2*RAW/LgRAW - 1 = (R/PA)/(LgR/PA). These equations are related in the same way Pyth and DTE are.

DTE is not an important topic in W% estimators, but the math elements interest me, so you got a rambling essay about it.