Wednesday, July 08, 2020

April 4, 1994 pt. 2

I’ve previously written about the Indians/Mariners opening day game of April 4, 1994 that made me a baseball fan. I won’t rehash all that in detail again, but I recently was able to watch a replay of this game for the first time. I’d never actually seen any of it before, except for highlights – in real time I listened to about the seventh inning forward on the radio.

For the rewatch, I kept a scoresheet, which is reproduced at the bottom of the post. A few observations:

* Chris Berman and Buck Martinez called the game on ESPN. Berman was not as terrible as I remember him being, but most of my exposure was later. That is not to say that he was good. Martinez is a middling announcer with a terrible voice, and was in 1994 as well. It would be a real treat to be able to watch this game with the local radio call of Herb Score and Tom Hamilton that I would have enjoyed in place of the national guys.

* Randy Johnson had a no-hitter through seven, which was noteworthy for reasons beyond the obvious. As all Indians fans know, Bob Feller is the only pitcher to throw an Opening Day no-hitter, and here was a threat to no-hit the Indians in on Opening Day in their first game in their new park with Feller on hand. Plus Randy Johnson, while not yet the legend that he would be, was obviously a legitimate no-hit candidate. He’d already thrown one in 1990, and 1993 had been his breakout year, finishing second in the Cy Young voting and recording his third straight season with over 10 K/9.

So not having seen the game and filling in the details in my mind given what I knew about the Big Unit later, I assumed that he had spent the first seven innings carving Cleveland up. But that was not the case at all; it more resembled what you would have expected a Greg Hibbard no-hit bid to look like. Through seven, Johnson had walked four and fanned two on 94 pitches. His twenty-one outs were distributed as:

12 on groundouts (including 2 DPs)
5 on flyouts
2 on strikeouts
1 popout
1 caught stealing

His opposite number, Dennis Martinez, was pitching a similar game from a DIPS perspective with one big exception – the two out solo shot he yielded to Eric Anthony in the third. Otherwise, through seven Martinez had struck out four, walked four, and hit Edgar Martinez in the first inning (providing an early injury scare as Mike Blowers pinch-ran, all this after Martinez had appeared in just 42 games in 1993. He’d only appear in three more games the rest of April).

* Two future stars were languishing down in the Indians lineup – Manny Ramirez batting eighth, and Jim Thome on the bench. It would be some time before Thome was trusted to start against left-handed pitchers, and so Mark Lewis was the ninth-place hitter and third baseman. Ramirez provided a Manny being Manny moment. After Candy Maldonado walked to open the eighth and Sandy Alomar singled to break up the no-no, Manny clanged a 1-0 Johnson offering off the big wall in left for a game-tying double. With Mark Lewis looking to advance the go-ahead run to third base, Ramirez strayed two far off second and was picked off by a Dan Wilson throwback to second on the first pitch.

Ramirez and Thome were never in the game simultaneously; with the Indians down a run with one out in the tenth, Ramirez drew a walk and was replaced by pinch-runner Wayne Kirby. It was then that Thome batted for Lewis, which brought on lefty reliever King for Seattle. Thome pulled a double down the right field line to put runners at second and third, and Kirby would later score when Vizquel hit into a fielder’s choice. It would work out in the end two, as Kirby walked it off in the eleventh with a two-out, line drive single to left to score Eddie Murray with the winning run.

* Despite what I’m about to say below, this was a good game for star power as these two teams would emerge as top AL contenders of the latter half of the nineties: Hall of Famers Randy Johnson, Ken Griffey, Edgar Martinez, Eddie Murray, Jim Thome, future Hall of Famer Omar Vizquel, would have been Hall of Famer Manny Ramirez, should be Hall of Famer Kenny Lofton, could have been Hall of Famer Albert Belle, a former Rookie of the Year in Sandy Alomar, and other memorable names including Jay Buhner, Carlos Baerga, Tino Martinez, and Jose Mesa.

* One thing that struck me in re-watching it is what an ordinary game it was. Granted, given the circumstances (opening day and opening game of a new park) it was extremely memorable for Indians fans, but if you strip all that out and just evaluate it as a game, it wouldn’t be the most exciting of most major league team’s seasons. I have personally attended at least six Indians games in the last four seasons that were more compelling, and I’ve only been to about sixty games in that time and I’m making that list from the top of my head. I had built it up in my head as a kind of epic, and in some senses it disappointed upon rewatch.

On the other hand, that disappointment reminded me of what a great game baseball is. I have now watched twenty-six seasons of major league baseball and perhaps become jaded about just how interesting and exciting baseball inherently is. That this game wouldn’t rank in the top 10% of games I’ve attended recently speaks to what an amazing game baseball is. Since this game was sufficient to almost instantly turn me into a baseball nut, I suspect that a much less exciting contest would have done the trick. And it should have...I’m repeating myself again, and as I write this we are still two and a half weeks from even the possibility of baseball in 2020, and that too reminds me that baseball is just the best in every way.

* More generally on the franchise that I yolked myself to on April 4, 1994, I have no comment on the fact that the Indians will liekely soon be changing their name itself. I do have two strains of thought on possible future names:

1. I think “Expos” is the logical choice, which is a snarky way of saying that my suspicion is that this name will be changing again in the relatively near future as the franchise settles into its new home in Montreal, Nashville, Portland, Las Vegas, Charlotte, etc.

2. “Spiders” is a dreadful option. First of all, as a general philosophy, I believe that baseball team names should be non-threatening. Most baseball team names are – I would contend that the only exceptions among the sixteen teams dating to 1901 or earlier are Pirates and Tigers, depending on what you think (very carefully) about Braves and Indians. Cubs are not an animal I would wish to encounter, but the name suggests cute and cuddly teddy bears rather than miniature grizzlies. Among expansion team names, the only one that I would classify as even mildly threatening is Rangers, and I would suspect the desired effect is strength and honor rather than menace.

The exception is the 1998 expansion. The Devil Rays and the Diamondbacks both sound threatening, although the former is actually generally harmless (to humans at least, and I think that’s all we should consider lest all the bird names become threatening) and was later downgraded to the double meaning “Rays” anyway. The latter is a scary animal, but is also in my opinion a contender for best expansion team name, due to the baseball tie in (my other contender for best expansion team names would be Brewers (although that was recycled), Colt .45s/Astros, and Pilots/Mariners. The latter was a case in which the city had a great name and then got a similar yet superior one eight years later).

So I would contend that Spiders is contrary to the spirit of baseball nicknames. The history of the name is also quite problematic (although quite appropriate if my misgivings about the future of the franchise are founded). The original Spiders represented Cleveland in the National League from 1887-1899, never winning a pennant. In the early 1890s they were a strong outfit, finishing second three times and even capturing a Temple Cup (which I do not in any way deem to be comparable to a regular season pennant) with names like Cy Young and Jesse Burkett, but were soon a victim of the systemic corruption of the 1890s NL, with owner Stanley Robison siphoning off talent for the St. Louis now-Cardinals in which he also had a stake. As you probably know, this culminated in the 20-134 debacle of 1899 before the team joined Detroit, Lousiville, and Washington on the chopping block, leaving Cleveland open for Ban Johnson’s play at major status for the American League two years later. I would contend that this is quite an ignominious history and nothing to be celebrated or emulated.

If Cleveland’s major league history must be the first source of inspiration, the Indians’ prior unofficial names won’t cut it: Blues is boring, Cleveland isn’t supporting a team called the Broncos, Naps would be fine with me but doesn’t sell and the headlines write themselves. The Players League outfit was referred to as the Infants. The Negro Leagues don’t provide much in the way of an option, as Cleveland’s proudest entry was the Buckeyes, a name of which THE sports team of only entity is worthy.

I do think there is one Cleveland major league name that would work – the first, the Forest City club which represented the city in the National Association during 1871-72. This team actually participated in the NA’s first league game on May 4, 1871. Maybe you’d have to rework it to Foresters (or even Sawyers), but it’s a name I could get behind.

Best non-historical choice, although semi-violating my own suggested rule about nice namesakes: Buzzards.



Wednesday, July 01, 2020

"Replacement Level" Managers

This is an old post that I never published. It's not good, as it just presents something of a freak show stat, but I was mildly interested by it when I re-read it so maybe someone out there will be as well. All of the facts/figures are through 2009 and I did not update them at all. I did not one factual error which is also not corrected - Billy Southworth was inducted into the HOF in 2008.

I put quotes around "replacement level" in the title because this article is not really about establishing a replacement level for managers in the same sense as the phrase would imply when discussing players. It is rather about establishing a baseline for crude comparisons of managerial records, in the same vein as WAR--but without any claim that the baseline represents the point at which talent is freely available.

After all, it's folly to hold up a manger's W-L record as the sole evidence of his quality as a manager. Even the most ardent believers in the importance of managers to a team's record cannot possibly believe that they can separate the manager's contribution from all of the other noise that goes into a team's record.

If you want a crude method to compare managerial W-L records, there are few options that come to mind. Conventional approaches would include just looking at total wins, winning percentage, and games over .500, just as one might do with pitcher W-L records.

Of course, my own initial thought as a sabermetrician is to turn to a baseline that values longevity to some extent. If a manager is allowed to direct 3,942 major league games, yet has a sub-.500 record, it would be silly to assign him a negative number and move on (Gene Mauch). Managers are obviously employable even with losing records, and there are many factors well outside the manager's control that contribute to a team's record.

So my natural inclination is to look at a manager's wins above replacement, which inevitably leads to a decision about how to define managerial replacement level. There are a lot of ways to estimate replacement level for players, but one of the simplest is to look at the aggregate performance of players given very little playing time. The analogous solution would be to look at managerial records for those managers that were replacements, managing less than a full season of games.

When using this approach for players, one must be careful to consider the selective sampling issues involved--players that fail in an initial trial are less likely to receive future playing time, even though it is possible that their true talent is greater (the opposite is also true to some extent). The same is also likely true to some extent for managers--managers whose teams do not perform well in an initial interim role are not as likely to be retained. However, since my application here is just establishing a rough baseline to use for ultimately unimportant comparisons of managerial records, I am simply going to proceed as if these concerns are irrelevant.

The goal is not to devise a rating system for managers; it is to find a crude baseline to use for comparing un-contextualized managerial records. The freak show nature of the exercise is evident, and hopefully will serve to excuse my playing fast and loose with proper research procedure.

What I did was look at career records for all managers with less than 154 games managed (Although I then removed managers who served full season stints in seasons with less than 154 games from the list as well, as well as Cubs managers from the early 60s who were part of the College of Coaches experiment and Stanley Robison and Ted Turner, who owned their teams and weren't real managers.) from 1901-2009. This is my group of "replacement-level" managers. There are 109 such managers, serving in a total of 135 different team-seasons. Their career totals of games managed range from one (ten managers, with either Rudy York or Eddie Yost as the biggest name) to 149 (Tom Runnells with the 1991-92 Expos).

Overall, they managed 5530 games (an average of 41 games each), going 2322-3208 for a .420 W%. So that will be my baseline for managerial records--.420.

By using .420 as a baseline, I don't mean to imply that it is a replacement-level in the traditional sense. It is quite possible that interim managers generally don't keep their jobs if they don't manage at least a .420 W%, but I don't mean to imply that replacement managers are ".420 managers".

If one was to attempt to measure a manager's replacement level in terms of actual effect on a team attributable to the skipper, my intuition is that it would be close to .500. There are simply too many possible candidates for managerial positions for me to think otherwise. Regardless, though, this "study" in no way indicates that the managers lowered .500 teams to .420.

The teams had a total aggregate record (with both the replacement and non-replacement managers) of 9334-11752, a .443 W%. This comparison does not take into account that the games managed by replacements ranged from one to over 140.

A crude way to compare team performance with and without the replacement level manager is to weight each team-season by the minimum of games managed by the replacement and other games. Using this approach, the weighted average of (W% with replacement manager - W% otherwise) is -.019.

Another crude approach is to weight by the harmonic mean of games managed by the replacement and others, rather than the minimum of the two. The weighted average difference is -.025 when using the harmonic mean. Those results should not be used to draw any conclusions, but without any regression or significance testing they imply that a replacement-level manager might lower a .500 team to .480 or .475, a difference in the range of four games a year. I am not claiming that is true, for the selective sampling reasons discussed previously among a myriad of other reasons.

With that out of the way, I will present some data on managerial records above .420 for managers, 1901-2009. I'll call this Austin Rating in honor of Jimmy Austin, who is the only man to serve three such stints as manager (all with the Browns) without reaching 154 career games. Austin's player-manager career started with St. Louis in 1913, replacing George Stovall temporarily (2-6) before Branch Rickey took over permanently. He also did a stint in 1918 (7-9) in relief of Fielder Jones before Jimmy Burke stepped in. His final and longest experience at the helm was in 1923, when he was 22-29 replacing Lee Fohl. His career 31-44 mark (.413) is a little below the .420 baseline, so his own Austin Rating is -.5.

Here are the top 25 career managers (again, through 2009):



There are sixteen Hall of Fame managers from this period; fourteen are in the top 25 for Austin Rating, with Whitey Herzog (270, 28th) and Wilbert Robinson (224, 34th) just missing the top 25. This is not offered as an indication that Austin Rating tracks HOF managerial choices or that it correctly identifies good managers, as any reasonable system based on career wins and losses would likely produce similar results for Hall of Fame skippers.

Going down the list, the non-Hall of Famers are either active or recently retired (Cox, LaRussa, Torre, Piniella) or in the Hall of Fame as a player (Clarke) until you get to Billy Southworth (Clark Griffith is also in the Hall, with a noteworthy career in the areas of playing, managing, and ownership). Southworth does not have wins in bulk (which seem to be the true indicator of HOF selection), but his .597 W% results in a very strong Austin Rating.

Here are the bottom ten managers:



Most of these guys served in the early part of the twentieth century, when competitive balance was less pronounced and multiple franchises had long walks in the wilderness. Protho brings up the rear for managing three teams in Phillies dreadful pre-War stretch (1939-41). The only manager on the list that commanded over half of his games post-1950 was Roy Hartsfield, original skipper of the expansion Blue Jays. Extending the list down to 13th would include Alan Trammell, while Manny Acta ranks 18th lowest, but including 2010 would give him a slight bump as the Indians scraped over the .420 mark.

Finally, here is the leader in Austin Rating for each current team in their current city (except Washington which doesn't have much of a history; record with that franchise only):

Tuesday, June 16, 2020

Preoccupied With 1985: Linear Weights and the Historical Abstract

I stumbled across this unpublished post while cleaning up some files – it was not particularly timely when written about ten years ago, and is even less timely now. Unlike some other old pieces I find, though, I don’t know why I never published it, other than maybe redundancy and beating a dead horse. I still agree with the opinions I expressed, and it is well above the low bar required for inclusion on this blog.

The original edition of Bill James’ Historical Baseball Abstract, published in 1985, is my favorite baseball book, and I am far from the only well-read baseball aficionado who holds it in such high regard. It contains a very engaging walk through each decade in major league history, some interesting material on rating players (including what has to be one of the first explicit discussions of peak versus career value in those terms), ratings of the best players by position and the top 100 players overall, and career statistics for about 200 all-time greats which seem like nothing in the internet age but at the time represented the most comprehensive collation on those players.

However, there is one section of the book which does not hold up well at all. It really didn’t hold up at the time, but I wasn’t in a position to judge that. James reviews The Hidden Game of Baseball, published the previous year by John Thorn and Pete Palmer, and gives his thoughts about the Linear Weights system.

James’ lifelong aversion to linear weights is somewhat legendary among those of us who delve deeply into these issues, but the discussion in the Historical Abstract is the source of the river, at least in terms of James’ published material. For years, James’ thoughts colored the perception of linear weights by many consumers of sabermetric research. This is no longer the case, as many people interested in sabermetrics twenty-five years later have never read the original book, and linear weights have been rehabilitated and widely accepted through the work of Mitchel Lichtman, Tom Tango, and now many others.

So to go back thirty years later and rake James’ essay over the coals is admittedly unfair. You may choose to look at this as gratuitous James-bashing if you please; that is not my intent, but I won’t protest any further than this paragraph. I think that some of the arguments James advances against linear weights are still heard today in different words, and occasionally you will still see a reference to the article from an old Runs Created diehard. And if one can address the concerns of the Bill James of 1985 on linear weights, it should go a long way in addressing the concerns of other critics.

It should be noted that James on the whole is quite complementary of The Hidden Game and its authors. I will be focusing on his critical comments on methodology, and so any excerpts I use will be of the argumentative variety and if taken without the disclaimer could give the wrong impression of James’ view of the work as a whole.

The first substantive argument that James offers against Palmer’s linear weights (in this case, really, the discussion is focused on the Batting Runs component) is their accuracy. The formula in question is:

BR = .46S + .80D + 1.02T + 1.40HR + .33(W + HB) + .3SB - .6CS - .25(AB - H) - .5(OOB)

As you know, Palmer’s formula uses an out value that returns an estimate of runs above average rather than absolute runs scored (in which case it would be somewhere around -.1). The formula listed by Palmer fixes the out value at -.25, but it is explained that the actual value is to be calculated for each league-season. James notes this, but then ignores it in using the Batting Runs formula to estimate team runs scored. To do so, he simply adds the above result to the league average of runs scored per team for the season. He opines that the resulting estimates are “[do] not, in fact, meet any reasonable standard of accuracy as a predictor of runs scored.”

And it’s true--they don’t. This is not because the BR formula does not work, but rather because James applied it incorrectly. As he explains, “For the sake of clarity, the formula as it appears above yields the number of runs that the teams should be above or below the league average; when you add in the league average, as I did here, you should get the number of runs that they score.”

This seems reasonable enough, but in fact it is an incorrect application of the formula. The correct way to use a linear weights above average formula to estimate total runs scored is to add the result to the league average runs/out multiplied by the number of outs the team actually made.

This can be demonstrated pretty simply by using the same league-seasons (1983, both leagues) that James uses in the initial test in the Historical Abstract. If you use the BR formula using -.25 as the out weight and simply add the result to the league average runs scored (in each respective league), the RMSE is 29.5. Refine that a little bit by adding in the number of outs each team made multiplied by the respective league runs/out (but still using -.25 as the out weight), the RMSE improves to 29.3. The James formula that uses the most comparable input, stolen base RC, has a RMSE of 24.4, and you can see why (in this limited sample; I’m certainly not advocating paying much heed to accuracy tests based on one year of data, and neither was James) he thought BR was less accurate. But had he applied the formula properly, by figuring custom out values for each league (-.255 in the AL and -.244 in the NL) and adding the resulting RAA estimate to league runs/out times team outs, he would have gotten a RMSE of 18.7.

In fairness to James, the authors of The Hidden Game did not do a great job in explaining the intricacies of linear weight calculations. The book is largely non-technical, and nitty-gritty details are glossed over. The proper method to compute total runs scored from the RAA estimate is never exactly explained, nor is the precise way to calculate the out value specific to a league-season (while it’s a matter of simple algebra, presenting the formula explicitly would have cleared up some confusion). To do a fair accuracy test versus a method like Runs Created, which does not take into account any data on league averages, you would also need to calculate the -.1 out value over a large sample and hold it constant, which Thorn and Palmer did not do or explain. In addition, the accuracy test was not as well-designed as it could have been, although that wouldn’t have had much of an impact on the results for Batting Runs or Runs Created, but rather for rate stats converted to runs.

James then goes on to explain the advantage that Batting Runs has in terms of being able to hone in on the correct value for runs scored, since it is defined to be correct on the league level. He is absolutely correct (as discussed in the preceding paragraph) that this is an unfair advantage to bestow in a run estimator accuracy test; however, it is also demonstrable that even under a fair test, Batting Runs and other similar linear weight methods acquit themselves nicely and are more accurate than comparable contemporary versions of Runs Created.

In the course of this discussion, James writes “What I would say, of course, is that while baseball changes, it changes very slowly over a long period of time; the value of an out in the American League in 1987 will be virtually identical with the value of an out in the American League in 1988.” This turned out to be an unfortunate future example for James since the AL averaged 4.90 runs/game in 1987 but just 4.36 in 1988. James’ point has merit--values should not jump around wildly for no reason other than the need to minimize RMSE--but the Batting Runs out value does not generally behave in a matter inconsistent with simply tracking changes in league scoring.

James’ big conclusion on linear weights is: “I think that the system of evaluation by linear weights is not at all accurate to begin with, does not become any more accurate with the substitution of figures derived from one season’s worth of data…Linear weights cannot possibly evaluate offense for the simplest of reasons: Offense is not linear.”

He continues “The creation of runs is not a linear activity, in which each element of the offense has a given weight regardless of the situation, but rather a geometric activity, in which the value of each element is dependent on the other elements.” James is correct that offense is not linear and that the value of any given event is dependent on the frequency of other events. But his conclusion that linear weights are incapable of evaluating offense is only supported by his faulty interpretation of the accuracy of Batting Runs. While offense is not linear, team offense is restricted to a narrow enough range that linear methods can accurately estimate team runs scored.

More importantly, James fails to recognize that while offense is dynamic, a poor dynamic estimator (such as his own Runs Created) is not necessarily (and in fact, is not) going to perform better than a linear weight method at the task of estimating runs scored. He also does not consider the problems that might be inherent in applying a dynamic run estimator directly to an individual player’s batting line, when the player is in fact a member of a team rather than his own team. Eventually, he would come to this realization and begin using a theoretical team version of Runs Created (which is one of the many reasons this criticism of his thirty-five year old essay can be viewed as unfair).

Much of the misunderstanding probably could have been avoided had Batting Runs been presented as absolute runs rather than runs above average. Palmer has never used an absolute version in any of his books, but of course many others have used absolute linear weight methods. One of the more prominent is Paul Johnson’s Estimated Runs Produced, which was brought to the public eye when none other than Bill James published Johnson's article in the 1985 Abstract annual.

Johnson’s ERP formula was dressed up in a way that made it plain to see that it was linear, but did not explicitly show the coefficient for each event as Batting Runs did. Still, it remains almost inexplicable that an analyst of James’ caliber did not see the connection between the two approaches, as he was writing two very different opinions on the merits of each nearly simultaneously.

James also applies his broad brush to Palmer’s win estimation method, saying that if you ask the Pythagorean method “If a team scores 800 runs and allows 600, how many games will they win?”, it gives you an answer (104), while “the linear weights” says “ask me after the season is over.”

The use of the phrase “wait until the season is over” is the kind of ill-conceived rhetoric that seems out of place in a James work but would be expected in a criticism of him by a clueless sportswriter. Any metric that compares to a baseline or includes anything other than the player’s own performance (such as a league average or a park factor) is going to see its output change as that independent input changes. That goes for many of James’ metrics as well (OW% for instance).

To the extent that the criticism has any validity, it should be used in the context of Batting Runs, since admittedly Palmer did not explain how to use linear weights to figure an absolute estimate of runs in the nature of Runs Created. To apply it to Palmer’s win estimator (RPW = 10*sqrt(runs per inning by both teams)) simply does not make sense. The win estimator does not rely on the league average; it accounts for the fact that each run is less valuable to a win as the total number of runs scored increases, but it doesn’t require the use of anything other than the actual statistics of the team and its opponents. (Of course, when applied to an individual player’s Batting Runs it does use the league average, which again is no different conceptually than many of James’ methods.) The Pythagorean formula with a fixed exponent has the benefit (compared to a linear estimator, even a dynamic one) of restricting W% to the range [0, 1], but it also treats all equal run ratios as translating to equal win ratios.

James concludes his essay by comparing the offensive production of Luke Easter in 1950 and Jimmy Wynn in 1968. His methods show Easter creating 94 runs making 402 outs and Wynn creating 91 runs making 413 outs, while Batting Runs shows Easter as +29 runs and Wynn +26.

James goes on to point out that the league Easter played in averaged 5.04 runs per game, while Wynn’s league averaged 3.43, and thus Wynn was the far superior offensive player, by a margin of +37 to +18 runs using RC. “Same problem--the linear weights method does not adapt to the needs of the analysis, and thus does not produce an accurate portrayal of the subject.”

In this case, James simply missed the disclaimer that the out weight varies with each league-season. While it makes sense to criticize the treatment of the league average as a known in testing the accuracy of a run estimator, it doesn’t make any sense at all to criticize using it when putting a batter’s season into context. Of course, James agrees that context is important, as he converts Easter and Wynn’s RC into baselined metrics in the same discussion.

When Batting Runs is allowed to calculate its out value as intended, it produces a similar verdict on the value of Easter and Wynn. In Total Baseball (using a slightly different but very much same in spirit Batting Runs formula), Palmer estimates Wynn at +38 and Easter at +14, essentially in agreement with from James’ estimate of +37 and +18. The concept of linear weights did not fail; James’ comprehension of it did. It doesn’t matter if that happened because Palmer and Thorn’s explanation wasn’t straightforward (or comprehensive) enough, or whether James just missed the boat, or a combination of both. Whatever the reason, the essay “Finding the Hidden Game, pt. 3” is not a fair or accurate assessment of the utility of linear weight methods and stands as the only real blemish on as good of a baseball book as has ever been written.

Monday, June 15, 2020

Tripod: Baselines

See the first paragraph of this post for an explanation of this series.

This essay will touch on the topics of various baselines and which are appropriate(in my opinion) for what you are trying to measure. In other words, it discusses things like replacement level. This is a topic that creates a lot of debate and acrimony among sabermetricians. A lot of this has to do with semantics, so all that follows is my opinion, some of it backed by facts and some of it just opinion.

Again, I cannot stress this enough; different baselines for different questions. When you want to know what baseline you want to use, first ask the question: what am I trying to measure?

Anyway, this discussion is kind of disjointed, so I'll just put up a heading for a topic and write on it.

Individual Winning Percentage

Usually the baseline is discussed in terms of a winning percentage. This unfortunate practice stems from Bill James' Offensive Winning Percentage. What is OW%? For instance, if Jason Giambi creates 11 runs per game in a context where the average team scores 5 runs per game, than Giambi's OW% is the W% you would expect when a team scores 11 runs and allows 5(.829 when using Pyth ex 2). It is important to note that OW% assumes that the team has average defense.

So people will refer to a replacement level of say .333, and what they mean is that the player's net value should be calculated as the number of runs or wins he created above what a .333 player would have done. This gets very confusing when people try to frame the discussion of what the replacement level should be in terms of actual team W%s. They'll say something like, "the bottom 1% of teams have an average W% of .300, so let's make .300 the replacement level". That's fine, but the .300 team got its record from both its offense and defense. If the team had an OW% of .300 and a corresponding DW% of .300, their record would be about .155.

Confusing, eh? And part of that comes from the silly idea of putting a player's individual performance in the form of a team's W%. So, I prefer to define replacement level in terms of percentage of the league average the player performed at. It is much easier to deal with, and it just makes more sense. But I may use both interchangeably here since most people discuss this in terms of W%.

ADDED 12/04: You can safely skip this part and understand the rest of the article; it's really about a different subject anyway. I should note the weakness of the % of league approach. The impact of performing at 120% of the league is different at different levels of run scoring. The reason for this is that the % of league for a particular player is essentially a run ratio(like runs scored/runs allowed for a team). We are saying that said player creates 20% more runs than his counterpart, which we then translate into a W% by the Pythagorean by 1.2^2/(1.2^2+1)=.590. But as you can read in the "W% Estimator" article, the ideal exponent varies based on RPG. In a 10 RPG context(fairly normal), the ideal exponent is around 1.95. But in a 8 RPG context, it is around 1.83. So in the first case a 1.2 run ratio gives a .588 W%, but in the other it gives a .583. Now this is a fairly minor factor in most cases, but we want to be as precise as possible.

So from this you might determine that indeed the W% display method is ideal, but the W% approach serves to ruin the proportional relationship between various Run Ratios(with a Pyth exponent of 2, a 2 RR gives an .800 W%, while a 1 RR gives .500, but 2 is 2 times as high as 1, not .8/.5). So the ideal thing as far as I'm concerned is to use the % of league, but translate it into a win ratio by raising it to the proper pythagorean exponent for the context(which can be figured approximately as RPG^.28). But this shouldn't have too big of an impact on the replacement level front. If you like the win ratio idea but want to convert it back into a run ratio, you can pick a "standard" league that you want to translate everybody back into(ala Clay Davenport). So if you want a league with a pyth exponent of 2, take the square root of the win ratio to get the run ratio. Generally (W/L) = (R/RA)^x or (R/RA) = (W/L)^(1/x).

Absolute Value

This is a good place to start. Why do we need a baseline in the first place? Why can't we just look at a player's Runs Created, and be done with it? Sabermetricians, I apologize, this will be quite patronizing for you.

Well, let's start by looking at a couple of players:

H D T HR W

145 17 1 19 13

128 32 2 19 25

The first guy has 68 RC, the second guy has 69. But when you discover that Player A made 338 outs and Player B made 284 outs, the choice becomes pretty clear, no? BTW, player A is Randall Simon and player B is Ivan Rodriguez(2003).

But you could say that we should have known player B was better, because we could just look at his runs/out. But of course I could give you an example of 2 guys with .2 runs/out, but one made 100 outs and had 20 RC and another made 500 outs and had 100 RC. And so you see that there must be some kind of balance between the total and the rate.

The common sense way to do this with a baseline. Some people, like a certain infamous SABR-L poster, will go to extreme lengths to attempt to combine the total and the rate in one number, using all sorts of illogical devices. A baseline is logical. It kills two or three birds with one stone. For one thing, we can incorporate both the total production and the rate of production. For another, we eventually want to evaluate the player against some sort of standard, and that standard can be the baseline that we use. And using a baseline automatically inserts an adjustment for league context.

There is value in every positive act done on a major league field. There is no way that you can provide negative absolute value. If you bat 500 times, make 499 outs and draw 1 walk, you have still contributed SOMETHING to your team. You have provided some value to that team.

But on the other hand, the team could have easily, for the minimum salary, found someone who could contribute far, far more than you could. So you have no value to the team in an economic sense. The team has no reason to pay you a cent, because they can find someone you can put up a .000/.002/.000 line panhandling on the street. This extreme example just goes to show why evaluating a major league player by the total amount of production he has put up is silly. That leads into the question of what is level at which a team can easily find a player who can play that well?

Minimum Level

This is where a lot of analysts like to draw the baseline. They will find the level at which there are dozens of available AAA players who perform that well, and that is the line against which they evaluate players. Those players are numerous and therefore have no real value to a team. They can call up another one from AAA, or find one on waivers, or sign one out of the Atlantic League. Whatever.

There are a number of different ways of describing this, though. One is the "Freely Available Talent" level. That's sort of the economic argument I spelled out. But is it really free? This might be nitpicking, but I think it is important to remember that all teams spend a great deal of money on player development. If you give your first round pick a $2 million bonus and he turns out to be a "FAT" player, he wasn't really free. Of course, he is freely available to whoever might want to take him off your hands. But I like the analogy of say, getting together with your friends, and throwing your car keys in a box, and then picking one randomly and taking that car. If you put your Chevy Metro keys in there and draw out somebody's Ford Festiva keys, you didn't get anywhere. And while you now have the Festiva, it wasn't free. This is exactly what major league teams do when they pick each other's junk up. They have all poured money into developing the talent and have given up something to acquire it(namely their junk). None of this changes the fact that it is freely available or really provides any evidence against the FAT position at all; I just think it is important to remember that the talent may be free now, but it wasn't free before. Someone on FanHome proposed replacing FAT with Readily Available Talent or something like that, which makes some sense.

Another way people define this is the level at which a player can stay on a major league 25 man roster. There are many similar ways to describe it, and while there might be slight differences, they all are getting at the same underlying principle.

The most extensive study to establish what this line is was undertaken by Keith Woolner in the 2002 Baseball Prospectus. He determined that the minimum level was about equal to 80% of the league average, or approximately a .390 player. He, however, looked at all non-starters, producing a mishmash of bench players and true FAT players.

The basic idea behind all of these is that if a player fell of the face of the earth, his team would have to replace him, and the player who would replace them would be one of these readily available players. So it makes sense to compare the player to the player who would replace him in case of injury or other calamity.

A W% figure that is often associated with this line of reasoning is .350, although obviously there is no true answer and various other figures might give a better representation. But .350 has been established as a standard by methods like Equivalent Runs and Extrapolated Wins, and it is doubtful that it will be going anywhere any time soon.

Sustenance Level

This is kind of similar to the above. This is the idea that there is some level of minimum performance at which the team will no longer tolerate the player, and will replace him. This could be other from his status on the roster or as a starting player(obviously, the second will produce a higher baseline in theory). You could also call this "minimum sustainable performance" level.

Cliff Blau attempted a study to see when regular players lost their jobs based on their RG, at each position. While I have some issues with Blau's study, such as that it did not include league adjustments while covering some fairly different offensive contexts, his results are interesting none the less. He found no black line, no one level where teams threw in the towel. This really isn't that surprising, as there are a number of factors involved in whether or not a player keeps his job other than his offensive production(such as salary, previous production, potential, defensive contribution, nepotism, etc). But Bill James wrote in the 1985 Abstract that he expected there would be such a point. He was wrong, but we're all allowed to be sometimes.

Anyway, this idea makes sense. But a problem with it is that it is hard to pin down exactly where this line is-or for that matter, where the FAT line is. We don't have knowledge of a player's true ability, just a sample of varying size. The team might make decisions on who to replace based on a non-representative sample, or the sabermetrician might misjudge the talent of players in his study and thus misjudge the talent level. There are all sorts of selective sampling issues here. We also know that the major leagues are not comprised of the 750 best players in professional baseball. Maybe Ricky Weeks could hit better right now then the Brewers' utility infielder, but they want him to play every day in the minors. The point is, it is impossible to draw a firm baseline here. All of the approaches involve guesswork, as they must.

Some people have said we should define replacement level as the W% of the worst team in the league. Others have said it should be based on the worst teams in baseball over a period of years. Or maybe we should take out all of the starting players from the league and see what the performance level of the rest of them is. Anyway you do it, there's uncertainty, large potential for error, and a need to remember there's no firm line.

Average

But the uncertainty of the FAT or RAT or whatever baseline does leave people looking for something that is defined, and that is constant. And average fits that bill. The average player in the league always performs at a .500 level. The average team always has a .500 W%. So why not evaluate players based on their performance above what an average player would have done?

There are some points that can be made in favor of this approach. For one thing, the opponent that you play is on average a .500 opponent. If you are above .500, you will win more often then you lose. If you are below .500, you will lose more often that you win. The argument that a .500+ player is doing more to help his team win then his opponent is, while the .500- player is doing less to help his team win then his opponent is, makes for a very natural demarcation: win v loss.

Furthermore, the .500 approach is inherently built into any method of evaluating players that relies on Run Expectancy or Win Expectancy, such as empirical Linear Weights formulas. If you calculate the run value of each event as the final RE value minus the initial RE value plus runs scored on the play(which is what empirical LW methods are doing, or the value added approach as well), the average player will wind up at zero. Now the comparison to zero is not inevitable; you can fudge the formula or the results to compare to a non-.500 baseline, but initially the method is comparing to average.

An argument that I have made on behalf of the average baseline is that, when looking back in hindsight on the season, the only thing that ultimately matters is whether or not the player helped you to win more than your opponent. An opponent of the average baseline might look to a .510 player with 50 PA and say that he is less valuable than a .490 player with 500 PA, since the team still had 450 additional PA with the first player. This is related to the "replacement paradox" which I will discuss later, but ignoring that issue for now, my argument back would be that it is really irrelevant, because the 450 PA were filled by someone, and there's no use crying over spilled milk. The .490 player still did less to help his team win than his opponent did to help his team win. It seems as if the minimum level is more of a forward looking thing, saying "If a team could choose between two players with these profiles, they would take the second one", which is surely true. But the fact remains that the first player contributed to wins more than his opponent. From a value perspective, I don't necessarily have to care about what might have happened, I can just focus on what did happen. It is similar to the debate about whether to use clutch hitting stats, or actual pitcher $H data, even when we know that these traits are not strongly repetitive from season to season. Many people, arguing for a literal value approach, will say that we should use actual hits allowed or a player's actual value added runs, but will insist on comparing the player to his hypothetical replacement. This is not a cut and dry issue, but it reminds us of why it is so important to clearly define what we are trying to measure and let the definition lead us to the methodology.

Average is also a comfortable baseline for some people to use because it is a very natural one. Everybody knows what an average is, and it is easy to determine what an average player's Batting Average or walk rate should be. Using a non-.500 baseline, some of this inherent sense is lost and it is not so easy to determine how a .350 player for instance should perform.

Finally, the most readily accessible player evaluation method, at least until recently, was Pete Palmer's Linear Weights system. In the catch-all stat of the system, Total Player Rating, he used an average baseline. I have heard some people say that he justified because if you didn't go .500, you couldn't make the playoffs in the later editions of Total Baseball. However, in the final published edition, on page 540, he lays out a case for average. I will quote it extensively here since not many people have access to the book:

The translation from the various performance statistics into the wins or losses of TPR is accomplished by comparing each player to an average player at his position for that season in that league. While the use of the average player as the baseline in computing TPR may not seem intuitive to everyone, it is the best way to tell who is helping his team win games and who is costing his team wins. If a player is no better than his average counterparts on other teams, he is by definition not conferring any advantage on his team. Thus, while he may help his team win some individual games during the season--just as he will also help lose some individual games--over the course of a season or of a career, he isn't helping as much as his opponents are. Ultimately, a team full of worse-than-average players will lose more games than it wins.

The reason for using average performance as the standard is that it gives a truer picture of whether a player is helping or hurting his team. After all, almost every regular player is better than his replacement, and the members of the pool of replacement players available to a team are generally a lot worse than average regulars, for obvious reasons.

If Barry Bonds or Pedro Martinez is out of the lineup, the Giants or the Red Sox clearly don't have their equal waiting to substitute. The same is typically true for lesser mortals: when an average ballplayer cannot play, his team is not likely to have an average big-league regular sitting on the bench, ready to take his place.

Choosing replacement-level performance as the baseline for measuring TPR would not be unreasonable, but it wouldn't give a clear picture of how the contributions of each player translate into wins or losses. Compared to replacement-level performance, all regulars would look like winners. Similarly, when compared to a group of their peers, many reserve players would have positive values, even though they would still be losing games for their teams. Only the worst reserves would have negative values if replacement level were chosen as the baseline.

The crux of the problem is that a team composed of replacement-level players(which would be definition be neither plus nor minus in the aggregate if replacement-level is the baseline) would lose the great majority of its games! A team of players who were somewhat better than replacement level--but still worse than their corresponding average regulars--would lose more games than it won, even though the player values(compared to a replacement-level baseline) would all be positive.

Median

This is sort of related to the average school of thought. But these people will say that since the talent distribution in baseball is something like the far right hand portion of a bell curve, there are more below average players than above average players, but the superior performance of the above average players skew the mean. The average player may perform at .500, but if you were given the opportunity to take the #15 or #16 first baseman in baseball, they would actually be slightly below .500. So they would suggest that you cannot fault a player for being in below average if he is in the top half of players in the game.

It makes some sense, but for one thing, the median in Major League baseball is really not that dissimilar to the mean. A small study I did suggested that the median player performs at about 96% of the league mean in terms of run creation(approx. .480 in W% terms). It almost a negligible difference. Maybe it is farther from the mean than that(as other studies have suggested), but either way, it just does not seem to me to be a worthwhile distinction, and most sabermetricians are sympathetic to the minimum baseline anyway, so few of them would be interested in a median baseline that really is not much different from the mean.

Progressive Minimum

The progressive minimum school of thought was first expressed by Rob Wood, while trying to reconcile the average position and the minimum position, and was later suggested independently by Tango Tiger and Nate Silver as well. This camp holds that while if a player is injured and the team must scramble to find a .350 replacement, that does not bind them to using the .350 replacement forever. A true minimal level supporter wants us to compare Pete Rose, over his whole 20+ year career, to the player that would have replaced him had he been hurt at some point during that career. But if Pete Rose had been abducted by aliens in 1965, would the Reds have still been forced to have a .350 player in 1977? No. A team would either make a trade or free agent signing to improve, or the .350 player would become better and save his job, or a prospect would eventually come along to replace him.

Now the minimum level backer might object, saying that if you have to use resources to acquire a replacement, you are sacrificing potential improvement in other areas. This may be true to some extent, but every team at some point must sacrifice resources to improve themselves. It is not as if you can run a franchise solely on other people's trash. A team that tried to do this would eventually have no fans and would probably be repossessed by MLB. Even the Expos do not do this; they put money into their farm system, and it produced players like Guerrero and Vidro. They produced DeShields who they turned into Pedro Martinez. Every team has to make some moves to improve, so advocates of the progressive or time dependent baseline will say that it is silly to value a player based on an unrealistic representation of the way teams actually operate.

So how do we know how fast a team will improve from the original .350 replacement? Rob Wood and Tango looked at it from the perspective of an expansion teams. Expansion teams start pretty much with freely available talent, but on average, they reach .500 in 8 years. So Tango developed a model to estimate the W% of such a team in year 1, 2, 3, etc. A player who plays for one year would be compared to .350, but his second year might be compared to .365, etc. The theory goes that the longer a player is around, the more chances his team has had to replace him with a better player. Eventually, a team will come up with a .500 player. After all, the average team, expending an average amount of resources, puts out a .500 team.

Another area you could go to from here is whether or not the baseline should ever rise above .500. This is something that I personally am very uneasy with, since I feel that any player who contributes more to winning than his opponent does should be given a positive number. But you could make the case that if a player plays for 15 years in the show, at some point he should have provided above average performance. This approach would lead to a curve for a player's career that would rise from .350, up over .500 maybe to say .550 at its peak, and then tailing back down to .350. Certainly an intriguing concept.

Silver went at it differently, by looking at player's offensive performance charted against their career PA. It made a logarithmic curve and he fitted a line to it. As PA increase, offensive production rapidly increases, but then the curve flattens out. Comparing Silver's work to Tango's work, the baselines at various years were similar. This was encouraging to see similar results coming from two totally different and independent approaches.

A common argument against the progressive baseline is that even if you can eventually develop a .500 replacement, the presence of your current player does not inhibit the development of the replacement, so if your player does not get hurt or disappear, you could peddle the replacement to shore up another area, or use him as a backup, or something else. This is a good argument, but my counter might be that it is not just at one position where you will eventually develop average players; it is all over the diamond. The entire team is trending toward the mean(.500) at any given time, be it from .600 or from .320. Another potential counter to that argument is that some players can be acquired as free agent signings. Of course, these use up resources as well, just not human resources.

The best argument that I have seen against the progressive level is that if a team had a new .540 first baseman every year for 20 years, each would be evaluated against .350 first baseman. But if a team had the same .540 first baseman for 20 years, he would be evaluated against a .350, then a .365, then a .385, etc, and would be rated as having less value then the total of the other team's 20 players, even though each team got the exact same amount of production out of their first base position. However, this just shows that the progressive approach might not make sense from a team perspective, but does makes sense from the perspective of an individual player's career. Depending on what we want to measure, we can use different baselines.

Chaining

This is the faction that I am most at home in, possibly because I published this idea on FanHome. I borrowed the term "chaining" from Brock Hanke. Writing on the topic of replacement level in the 1998 BBBA, he said something to the effect that win you lose your first baseman, you don't just use him. You lose your best pinch hitter, who now has to man first base, and then he is replaced by some bum.

But this got me to thinking: if the team is replacing the first baseman with its top pinch hitter, who must be a better than minimum player or else he could easily be replaced, why must we compare the first baseman to the .350 player who now pinch hits? The pinch hitter might get 100 PA, but the first baseman gets 500 PA. So the actual effect on the team when the first baseman is lost is not that it gives 500 PA to a .350 player; no, instead it gives 500 PA to the .430 pinch hitter and 100 PA to the .350 player. And all of that dynamic is directly attributable to the first baseman himself. The actual baseline in that case should be something like .415.

The fundamental argument to back this up is that the player should be evaluated against the full scenario that would occur if he has to be replaced, not just the guy who takes his roster spot. Let's run through an example of chaining, with some numbers. Let's say that we have our starting first baseman who we'll call Ryan Klesko. We'll say Klesko has 550 PA, making 330 outs, and creates 110 runs. His backup racks up 100 PA, makes 65 outs, and creates 11 runs. Then we have a AAA player who will post a .310 OBA and create .135 runs/out, all in a league where the average is .18 runs/out. That makes Klesko a .775 player, his backup a .470 player, and the AAA guy a .360 player(ignoring defensive value and the fact that these guys are first baseman for the sake of example; we're also ignoring the effect of the individual's OBAs on their PAs below-the effect might be slight but it is real and would serve to decrease the performance of the non-Klesko team below). Now in this league, the FAT level is .135 R/O. So a minimalist would say that Klesko's value is (110/330-.135)*330 = +65.5 RAR. Or, alternatively, if the AAA player had taken Kleko's 550 PA(and it is the same thing as doing the RAR calculation), he would have 380 outs and 51 runs created.

Anyway, when Klesko and his backup are healthy, the team's first baseman have 650 PA, 395 outs, and 121 RC. But what happens if Klesko misses the season? His 550 PA will not go directly to the bum. The backup will assume Klesko's role and the bum will assume his. So the backup will now make 550/100*65=358 outs and create 11/65*358=61 runs. The bum will now bat 100 times, make 69 outs, and create .135*69=9 runs. So the team total now has 427 outs and 70 RC from its first baseman. We lose 51 runs and gain 32 outs. But in the first scenario, with the bum replacing Klesko directly(which is what a calculation against the FAT line implicitly assumes), the team total would be 445 outs and 62 runs created. So the chaining subtracts 18 outs and adds 8 runs. Klesko's real replacement is the 70/427 scenario. That is .164 runs/out, or 91% of the league average, or a .450 player. That is Klesko's true replacement. A .450 player. A big difference from the .360 player the minimalists would assume.

But what happens if the backup goes down? Well, he is just replaced directly by the bum, and so his true replacement level is a .360 player. Now people will say that it is unfair for Klesko to be compared to .450 and his backup to be compared to .360. But from the value perspective, that is just the way it is. The replacement for a starting player is simply a higher level than the replacement for a backup. This seems unfair, and it is a legitamite objection to chaining. But I suggest that it isn't that outlandish. For one thing, it seems to be the law of diminishing returns. Take the example of a team's run to win converter. The RPW suggested by Pythagorean is:

RPW = RD:G/(RR^x/(RR^x+1)-.5)

Where RD:G is run differential per game, RR is run ratio, and x is the exponent. We know that the exponent is approximately equal to RPG^.29. So a team that scores 5 runs per game and allows 4 runs per game has an RPW of 9.62. But what about a team that scores 5.5 and allows 4? Their RPW is 10.11.

So a team that scores .5 runs more than another is buying their wins at the cost of an additional .49 runs. This is somewhat similar to a starting player deriving value by being better than .450, and a backup deriving value by being better than .360. Diminishing returns. Now obviously, if your starter is .450, your backup must be less than that. So maybe the chained alternative should be tied to quality in the first place. Seems unfair again? Same principle. It's not something that we are used to considering in a player evaluation method, so it seems very weird, but the principle comes into play in other places(such as RPW) and we don't think of it as such because we are used to it.

Now an alternative way of addressing this is to point out the concept of different baselines for different purposes. A starting player, to keep his starting job, has a higher sustenance level than does a backup. Now since backups max out at say 200 PA, we could evaluate everyone's first 200 PA against the .360 level and their remaining PA against the .450 level. This may seem unfair, but I feel that it conforms to reality. A .400 player can help a team, but not if he gets 500 PA.

Some other objections to chaining will invariably come up. One objection is that not all teams have a backup to plug in at every position. Every team will invariably have a backup catcher, and somebody who can play some infield positions and some outfield positions, but maybe not on an everyday basis. And this is true. One solution might be to study the issue and find that say 65% of teams have a bench player capable of playing center field. So then the baseline for centerfield would be based 65% on chaining and 35% on just plugging the FAT player into the line. Or sometimes, more than one player will be hurt at once and the team will need a FAT player at one position. Another is that a player's position on the chain should not count against him. They will say that it is not the starter's fault that he has more room to be replaced under him. But really, it's not counting against him. This is the diminishing returns principle again. If he was not a starter, he would have less playing time, and would be able to accrue less value. And if you want to give "Klesko" credit for the value that his backup has that is greater than his backup, fine. You are giving him credit for a .360 player, but only over 100 PA, rather than the minimalist, who will extend him that value over all 550 of his PA. That is simply IMO not a realistic assessment of the scenario. All of these things just demonstrate that the baseline will not in the end be based solely on chaining; it would incorporate some of the FAT level as well.

When chaining first came up on FanHome, Tango did some studies of various things and determined that in fact, chaining coupled with adjusting for selective sampling could drive the baseline as high as 90%. I am not quite as gung ho, and I'm not sure that he still is, but I am certainly not convinced that he was wrong either.

Ultimately, it comes down to whether or not we are trying to model reality as best as possible or if we have an idealized idea of value. It is my opinion that chaining, incorporated at least somewhat in the baseline setting process, best models the reality of how major league teams adjust to loss in playing time. And without loss in playing time(actually variance in playing time), everyone would have equal opportunity and we wouldn't be having this darn discussion. Now I will be the first to admit that I do not have a firm handle on all the considerations and complexities that would go into designing a total evaluation around chaining. There are a lot of studies we would need to do to determine certain things. But I do feel that it must be incorporated into any effort to settle the baseline question for general player value.

Plus-.500 Baselines

If a .500 advocate can claim that the goal of baseball is to win games and sub-.500 players contribute less to winning then do their opponent, couldn't someone argue that the real goal is to make the playoffs, and that requires say a .560 W%, so shouldn't players be evaluated against .560?

I suppose you could make that argument. But to me at least, if a player does more to help his team win than his opponent does to help his team win, he should be given a positive number of a rating. My opinion, however, will not do much to convince people of this.

A better argument is that the idea of winning pennants or making the playoffs is a separate question than just winning. Let's take a player who performs at .100 one season and at .900 in the other. The player will rate, by the .560 standard, as a negative. He has hurt his team in its quest to win pennants.

But winning a pennant is a seasonal activity. In the season in which our first player performed at .900, one of the very best seasons in the history of the game, he probably added 12 wins above average to his team. That would take an 81 win team up to 93 and put them right in the pennant hunt. He has had an ENORMOUS individual impact on his team's playoff hopes, similar to what Barry Bonds has done in recent years for the Giants.

So his team wins the pennant in the .900 season, and he hurts their chances in the second season. But is their a penalty in baseball for not winning the pennant? No, there is not. Finishing 1 game out of the wildcard chase is no better, from the playoff perspective, than finishing 30 games out. So if in his .100 season he drags an 81 win team down to 69 wins, so what? They probably weren't going to make the playoffs anyway.

As Bill James said in the Politics of Glory, "A pennant is a real thing, an object in itself; if you win it, it's forever." The .100 performance does not in any way detract from the pennant that the player provided by playing .900 in a different season.

And so pennant value is a totally different animal. To properly evaluate pennant value, an approach such as the one proposed by Michael Wolverton in the 2002 Baseball Prospectus is necessary. Using a baseline in the traditional sense simply will not work.

Negative Value/Replacement Paradox

This is a common area of misunderstanding. If we use the FAT baseline, and a player rates negatively, we can safely assume that he really does have negative value. Not negative ABSOLUTE value--nobody can have negative absolute value. But he does have negative value to a major league team, because EVERYBODY from the minimalist to the progressivists to the averagists to the chainists would agree that they could find, for nothing, a better player.

But if we use a different baseline(average in particular is used this way), a negative runs or wins above baseline figure does not mean that the player has negative value. It simply means that he has less value then the baseline he is being compared to. It does not mean that he should not be employed by a major league team.

People will say something like, ".500 proponents would have us believe that if all of the sub-.500 players in baseball retired today, there would be no change in the quality of play tomorrow". Absolute hogwash! An average baseline does not in anyway mean that its proponents feel that a .490 player has no value or that there is an infinite supply of .500 players as there are of .350 players. It simply means that they choose to compare players to their opponent. It is a relative scale.

Even Bill James does not understand this or pretends not to understand this to promote his own method and discredit TPR(which uses a .500 baseline). For instance, in Win Shares, he writes: "Total Baseball tells us that Billy Herman was three times the player that Buddy Myer was." No, that's not what it's telling you. It's telling you that Herman had three times more value above his actual .500 opponent than did Myer. He writes "In a plus/minus system, below average players have no value." No, it tells you that below average players are less valuable than their opponent, and if you had a whole team of them you would lose more than you would win.

These same arguments could be turned against a .350 based system too. You could say that I rate at 0 WAR, since I never played in the majors, and that the system is saying that I am more valuable than Alvaro Espinoza. It's the exact same argument, and it's just as wrong going the other way as it is going this way.

And this naturally leads into something called the "replacement paradox". The replacement paradox is essentially that, using a .500 baseline, a .510 player with 10 PA will rate higher than a .499 player with 500 PA. And that is true. But the same is just as true at lower baselines. Advocates of the minimal baseline will often use the replacement paradox to attack a higher baseline. But it the sword can be turned against them. They will say that nobody really cares about the relative ratings of .345 and .355 players. But hasn't a .345 player with 500 PA shown themselves to have more ability than a .355 player with 10 PA. Yes, they have. Of course, on the other hand, they have also provided more evidence that they are a below average player as well. That kind of naturally leads in to the idea of using the baseline to estimate a player's true ability. Some have suggested a close to .500 baseline for this purpose. Of course, the replacement paradox holds wherever you go from 0 to 1 on the scale. I digress; back to the replacement paradox as it pertains to the minimal level. While we may not care that much about how .345 players rate against .355 players, it is also true that we're not sure exactly where that line really is as we are with the .500 line. How confident are we that it is .350 and not .330 or .370? And that uncertainty can wreck havoc with the ratings of players who for all we know could be above replacement level.

And now back to player ability; really, ability implies a rate stat. If there is an ability to stay healthy(some injuries undoubtedly occur primarily because of luck--what could Geoff Jenkins have done differently to avoid destroying his ankle for instance), and there almost certainly is, than that is separate from a player's ability to perform when he is on the field. And performance when you are on the field is a rate, almost by definition. Now a player who has performed at a .600 level for 50 PA is certainly not equal to someone with a .600 performance over 5000 PA. What we need is some kind of confidence interval. Maybe we are 95% confident that the first player's true ability lies between .400 and .800, and we are 95% confident that the second player's true ability lies between .575 and .625. Which is a safer bet? The second. Which is more likely to be a bum? The first, by a huge margin. Which is the more likely to be Babe Ruth? The first as well. Anyway, ability is a rate and does not need a baseline.

Win Shares and the Baseline

A whole new can of worms has been opened up by Bill James' Win Shares. Win Shares is very deceptive, because the hook is that a win share is 1/3 of an absolute win. Except it's not. It can't be.

Win Shares are divided between offense and defense based on marginal runs, which for the purpose of this debate, are runs above 50% of the league average. So the percentage of team marginal runs that come from offense is the percentage of team win shares that come from offense. Then each hitter gets "claim points" for their RC above a 50% player.

Anyway, if we figure these claim points for every player, some may come in below the 50% level. What does Bill James do? He zeroes them out. He is saying that the minimum level of value is 50%, but you do not have negative value against this standard, no matter how bad you are. Then, if you total up the positive claim points for the players on the team, the percentage of those belonging to our player is the percentage of offensive win shares they will get. The fundamental problem here is that he is divvying up absolute value based on marginal runs. The distortion may not be great, but it is there. Really, as David Smyth has pointed out, absolute wins are the product of absolute runs, and absolute losses are the product of absolute runs allowed. In other words, hitters don't make absolute losses, and pitchers don't make absolute wins. The proper measure for a hitter is therefore the number of absolute wins he contributes compare to some baseline. And this is exactly what every RAR or RAA formula does.

Now, to properly use Win Shares to measure value, you need to compare it to some baseline. Win Shares are incomplete without Loss Shares, or opportunity. Due to the convoluted nature of the method, I have no idea what the opportunity is. Win Shares are not really absolute wins. I think the best explanation is that they are wins above approximately .200, masqueraded as absolute wins. They're a mess.

My Opinions

The heading here is a misnomer, because my opinions are littered all throughout this article. Anyway, I think that the use of "replacement level" to mean the FAT, or RAT, or .350 level, has become so pervasive in the sabermetric community that some people have decided they don't need to even defend it anymore, and can simply shoot down other baselines because they do not match. This being an issue of semantics and theory, and something you can't prove like a Runs Created method, you should always be open to new ideas and be able to make a rational defense of your position. Note that I am not accusing all of the proponents of the .350 replacement rate as closed-minded; there are however some who don't wish to even consider other baselines.

A huge problem with comparing to the bottom barrel baseline is intelligibility. Let's just grant for the sake of discussion that comparing to .350 or .400 is the correct, proper way to go. Even so, what do I do if I have a player who is rated as +2 WAR? Is this a guy who I want to sign to a long term contract? Is he a guy who I should attempt to improve upon through trade, or is he a guy to build my team around?

Let's say he's a full time player with +2 WAR, making 400 outs. At 25 outs/game, he has 16 individual offensive games(following the IMO faulty OW% methodology as discussed at the beginning), and replacement level is .350, so his personal W% is .475. Everyone, even someone who wants to set the baseline at .500, would agree that this guy is a worthwhile player who can help a team, even in a starting job. He may contribute less to our team than his opponent does to his, but players who perform better than him are not that easy to find and would be costly to acquire.

So he's fine. But what if my whole lineup and starting rotation was made up of +2 WAR players? I would be a below average team at .475. So while every player in my lineup would have a positive rating, I wouldn't win anything with these guys. If I want to know if my team will be better than the team of my opponent, I'm going to need to figure out how many WAR an average player would have in 400 outs. He'd have +2.4, so I'm short by .4. Now I have two baselines. Except what is my new baseline? It's simply WAA. So to be able to know how I stack up against the competition, I need the average baseline as well.

That is not meant to be a refutation of value against the minimum; as I said, that theory of value could be completely true and the above would still hold. But to interpret the results, I'm going to need to introduce, whether intentionally or not, a higher baseline. Thus, I am making the case here that even if the minimum baseline properly describes value, other baselines have relevance and are useful as well.

As for me, I'm not convinced that it does properly define value. Why must I compare everybody to the worst player on the team? The worst player on the team is only there to suck up 100 PA or 50 mop up IP. A starting player who performs like the worst player is killing my team. And chaining shows that the true effect on my team if I was to lose my starter is not to simply insert this bum in his place in many cases. So why is my starter's value based on how much better he is than that guy?

Isn't value supposed to measure the player's actual effect on the team? So if the effect on my team losing a starter is that a .450 composite would take his place, why must I be compelled to compare to .350? It is ironic that as sabermetricians move more and more towards literal value methods like Win Expectancy Added and Value Added Runs or pseudo-literal value methods like Win Shares, which stress measuring what actually happened and crediting to the player, whether we can prove he has an ability to repeat this performance or not, they insist on baselining a player's value against what a hypothetical bottom of the barrel player would do and not by the baseline implied by the dynamics of a baseball team.

I did a little study of the 1990-1993 AL, classifying 9 starters and 4 backups for each team by their primary defensive position. Anyone who was not a starter or backup was classified as a FAT player. Let me point out now that there are numerous potential problems with this study, coming from selective sampling, and the fact that if a player gets hurt, his backup may wind up classified as the starter and the true starter as the backup, etc. There are many such issues with this study but I will use it as a basis for discussion and not as proof of anything.

The first interesting thing is to look at the total offensive performance for each group. Starters performed at 106.9 in terms of adjusted RG, or about a .530 player. Backups came in at 87.0, or about .430. And the leftovers, the "FAT" players, were at 73.8, or about .350.

So you can see where the .350 comes from. The players with the least playing time have an aggregate performance level of about .350. If you combine the totals for bench players and FAT players, properly weighted, you get 81.7, or about .400. And there you can see where someone like Woolner got 80%--the aggregate performance of non-starters.

Now, let's look at this in terms of a chaining sense. There are 4 bench players per team, and let's assume that each of them can play one position each on an everyday basis. We'll ignore DH, since anyone can be a DH. So 50% of the positions have a bench player, who performs at the 87% level, who can replace the starter, and the other half must be replaced by a 73.8%. The average there is 80.4%, or about .390. So even using these somewhat conservative assumptions, and even if the true FAT level is .350, the comparison point for a starter is about .390, in terms of the player who can actually replace him.

Just to address the problems in this study again, one is that if the starter goes down, the backup becomes the starter and the FAT guy becomes the backup. Or the backup goes downit doesn't adjust for the changes that teams are forced to make, that skew the results. Another is that if a player does not perform, he may be discarded. So there is selective sampling going on. A player might have the ability to perform at a .500 level, but if he plays at a .300 level for 75 PA, they're going to ship him out of their. This could especially effect the FAT players; they could easily be just as talented as the bench players, but hit poorly and that's why they didn't get enough PA to qualify as a bench player.

The point was not to draw conclusions from the study. Anyway, why can't we rate players against three standards? We could compare(using the study numbers even though we know they are not truly correct) to the .530 level for the player's value as a starter; to .430 for the player's value as a bench player; and to .350 for a player's value to play in the major leagues at any time. Call this the "muti-tiered" approach. And that's another point. The .350 players don't really have jobs! They get jobs in emergencies, and don't get many PA anyway. The real level of who can keep a job, on an ideal 25 man roster, is that bench player. Now if the average bench player is .430, maybe the minimum level for a bench player is .400.

Anyway, what is wrong with having three different measures for three different areas of player worth? People want one standard on which to evaluate players, for the "general" question of value. Well why can't we measure the first 50 PAs against FAT, the next 150 against the bench, and the others against a starter? And you could say, "Well, if a .400 player is a starter, that's not his fault; he shouldn't have PAs 200-550 measured against a starter." Maybe, maybe not. If you want to know what this player has actually done for his team, you want to take out the negative value. But if you're not after literal value, you could just zero out negative performances.

A player who gets 150 PA and plays at a .475 level has in a way helped his team, relative to his opponent. Because while the opponent as a whole is .500, the comparable piece on the other team is .430. So he has value to his team versus his opponent; the opponent doesn't have a backup capable of a .475 performance. But if the .475 player gets 600 PA, he is hurting you relative to your opponent.

And finally, let's tie my chaining argument back in with the progressive argument. Why should a player's career value be compared to a player who is barely good enough to play in the majors for two months? If a player is around for ten years, he better at some point during his career at least perform at the level of an average bench player.

Now the funny thing is that I am about to end this article, I am not going to tell you what baseline I want to use, if I was forced to choose one baseline. I know for sure it wouldn't be .350. It would be something between .400 and .500. .400 on the conservative side, evaluating the raw data without taking its biases into account. .500 if I say "screw it all, I want literal value." But I think I will use .350, and publish RAR, because even though I don't think it is right, it seems to be the consensus choice among sabermetricians, many of whom are smarter than me. Is that a sell out? No. I'll gladly argue with them about it any time they want. But since I don't really know what it is, and I don't really know how to go about studying it to find it and control all the biases and pitfalls, I choose to error on the side of caution. EXTREME caution(got to get one last dig in at the .350 level).

I have added a spreadsheet that allows you to experiment with the "multi-tiered" approach. Remember, I am not endorsing it, but I am presenting it as an option and one that I personally think has some merit.

Let me first explain the multi-tiered approach as I have sketched it out in the spreadsheet. There are two "PA Thresholds". The first PA threshold is the "FAT" threshold; anyone with less PA than this is considered a FAT player. The second is the backup threshold; anyone with more PA than that is a starter. Anyone with a number of PA in between is the two thresholds is a backup.

So, these are represented as PA Level 1 and PA Level 2 in the spreadsheet, and are set at 50 and 200. Play around with those if you like.

Next to the PA Levels, there are percentages for FAT, backup, and starter. These are the levels at which value is considered to start for a player in each class, expressed as a percentage of league r/o. Experiment with these too; I have set the FAT at .73, the backup at .87, and the starter at 1.

Below there, you can enter the PA, O, and RC for various real or imagined players. "N" is the league average runs/game. The peach colored cells are the ones that do the calculations, so don't edit them.

RG is simply the players' runs per game figure. V1 is the player's value compared to the FAT baseline. V2 is the player's value compared to the backup baseline. V3 is the player's value compared to the starter baseline. v1 is the player's value against the FAT baseline, with a twist; only 50 PA count(or whatever you have set the first PA threshold to be). Say you have a player with 600 PA. The multi-tiered theory holds that he has value in being an above FAT player, but only until he reaches the first PA threshold. Past that, you should not have to play a FAT player, and he no longer has value. If the player has less than 50 PA, his actual PA are used.

v2 does the same for the backup. Only the number of PA between the two thresholds count, so with the default, there are a maximum of 150 PA evaluated against this level. If the player has less than the first threshold, he doesn't get evaluated here at all; he gets a 0.

v3 applies the same concept to the starter level. The number of PA that the player has over the second threshold are evaluated against the starter level. If he has less than the second threshold, he gets a 0.

SUM is the sum of v1, v2, and v3. This is one of the possible end results of the multi-tiered approach. comp is the baseline that the SUM is comparing against. A player's value in runs above baseline can be written as:

(R-xN)*O/25 = L

Where x is the baseline, 25 is the o/g default, and L is the value. We can solve for x even if the equation we use to find L does not explicitly use a baseline(as is the case here):

x = (RG-25*L/O)/N

So the comp is the effective baseline used by the multi-tiered approach. As you will see, this will vary radically, which is the point of the multi-tiered approach.

The +only SUM column is the sum of v1, v2, and v3, but only counting positive values. If a player has a negative value for all 3, his +only SUM is 0. If a player has a v1 of +3, a v2 of +1, and a v3 of -10, his +only SUM is +4, while his SUM would have been -6. The +only SUM does nto penalize the player if he is used to an extent at which he no longer has value. It is another possible final value figure of the multi-tiered approach. The next column, comp, does the same thing as the other comp problem, except this time it is based on the +only SUM rather than the SUM.

So play around with this spreadsheet. See what the multi-tiered approach yields and whether you think it is worth the time of day.

One of the objections that would be raised to the multi-tiered approach is that there is a different baseline for each player. I think this is the strength of it, of course. Think of it like progressive tax brackets. It is an uncouth analogy for me to use, but it helps explain it, and we're not confiscating property from these baseball players. So, let's just say that the lowest bracket starts at 25K at 10%, and then at 35K it jumps to 15%. So would you rather make 34K or 36K?

It's an absurd question, isn't it? At 34K, you will pay $900 in taxes, while at 36K you will pay $1150, but your net income is $33,100 against $34,850. Of course you want to make 36K, even if that bumps you into the next bracket.

The same goes for the multi-tiered value. Sure, a player who performs at 120% of the league average in 190 PA is having his performance "taxed" at 83%, and one who has 300 PA is being "taxed" at 89%. But you'd still rather have the guy in 300 PA, and you'd still rather make 36 grand.

Now, just a brief word about the replacement level(s) used in the season statistics on this website. I have used the .350 standard although I personally think it is far too low. The reason I have done this is that I don't really have a good answer for what it should be(although I would think something in the .8-.9 region would be a reasonable compromise). Anyway, since I don't have a firm choice myself, I have decided to publish results based on the baseline chosen by plurality of the sabermetric community. This way they are useful to the most number of people for what they want to look at. Besides, you can go in on the spreadsheet and change the replacement levels to whatever the heck you want.

I am now using 73% (.350) for hitters, 125% (.390) for starters, and 111% (.450) for relievers. There is still a lot of room for discussion and debate, and I'm hardly confident that those are the best values to use.

Replacement Level Fielding

There are some systems, most notably Clay Davenport’s WARP, that include comparisons to replacement level fielders. I believe that these systems are incorrect, as are those that consider replacement level hitters; however, the distortions involved in the fielding case are much greater.

The premise of this belief is that players are chosen for their combined package of offense and defense, which shouldn’t be controversial. Teams also recognize that hitting is more important, even for a shortstop, than is fielding. Even a brilliant defensive shortstop like Mario Mendoza or John McDonald doesn’t get a lot of playing time when they create runs at 60% of the league average. And guys who hit worse than that just don’t wind up in the major leagues.

It also turns out at the major league level that hitting and fielding skill have a pretty small correlation, for those who play the same position. Obviously, in the population at large, people who are athletically gifted at hitting a baseball are going to carry over that talent to being gifted at fielding them. When you get to the major league level, though, you are dealing with elite athletes. My model (hardly groundbreaking) of the major league selection process is that you absolutely have to be able to hit at say 60% of the league average (excluding pitchers of course). If you can’t reach this level, no amount of fielding prowess is going to make up for your bat (see Mendoza example). Once you get over that hitting threshold, you are assigned to a position based on your fielding ability. If you can’t field, you play first base. Then, within that position, there is another hitting threshold that you have to meet (let’s just say it’s 88% of average for a first baseman).

The bottom line is that there is no such thing as a replacement level fielder or a replacement level fielder. There is only a replacement level player. A replacement level player might be a decent hitter (90 ARG) with limited defensive ability (think the 2006 version of Travis Lee), who is only able to fill in some times at first base, or he might be a dreadful hitter (65 ARG) who can catch, and is someone’s third catcher (any number of non-descript reserve catchers floating around baseball).

Thus, to compare a player to either a replacement level fielder or hitter is flawed; that’s not how baseball teams pick players. Your total package has to be good enough; if you are a “replacement level fielder” who can’t hit the ball out of the infielder, you probably never even get a professional contract. If you are a “replacement level hitter” who fields like Dick Stuart, well, you’d be a heck of a softball player.

However, if you do compare to a “replacement level hitter” at a given position, you can get away with it. Why? Because, as we agreed above, all players are chosen primarily for their hitting ability. It is the dominant skill, and by further narrowing things down by looking at only those who play the same position, you can end up with a pretty decent model. Ideally, one would be able to assign each player a total value (hitting and fielding) versus league average, but the nature of defense (how do you compare a first baseman to a center fielder defensively?) makes it harder. Not impossible, just harder, and since you can get away fairly well with doing it the other way, a lot of people (myself included) choose to do so.

Of course, there are others that just ignore it. I saw a NL MVP analysis for 2007 just yesterday on a well-respected analytical blog (I am not going to name it because I don’t want to pick on them) that simply gave each player a hitting Runs Above Average figure and added it to a fielding RAA figure which was relative to an average fielder at the position. The result is that Hanley Ramirez got -15 runs for being a bad shortstop, while Prince Fielder got -5 runs for being a bad first baseman. Who believes that Prince Fielder is a more valuable defensive asset to a team than Hanley Ramirez? Anyone?

Comparing to a replacement level fielder as Davenport does is bad too, but it is often not obvious to people. I hope that my logic above has convinced you why it is a bad idea; now let’s talk about the consequences of it. Davenport essentially says that a replacement level hitter is a .350 OW%, or 73 ARG hitter. This is uncontroversial and may be the most standard replacement level figure in use. But most people agreed upon this figure under the premise that it is an average defender who hits at that level. Davenport’s system gives positive value above replacement to anyone who can hit at this level, even if they are a first baseman. Then, comparing to a replacement level fielder serves as the position adjustment. Shortstops have a lower replacement level than first baseman (or, since the formula is not actually published, it seems like this is the case), and so even Hanley picks up more FRAR than Prince. However, now the overall replacement level is now much lower than .350.

So Davenport’s WARP cannot be directly compared to the RAR/WAR figures I publish, or even BP’s own VORP. If one wants to use a lower replacement level, they are free to do so, but since the point Davenport uses is so far out of line with the rest of the sabermetric community, it seems like some explanation and defense of his position would be helpful. Also, even if one accepts the need for a lower replacement level, it is not at all clear that the practice of splitting it into hitting and fielding is the best way to implement it.