Monday, August 13, 2007

Early NL Series: Intro and Run Estimation

My major interest in baseball research is theoretical sabermetrics. The “theoretical” label sounds a bit arrogant, but what I mean is that I am interested particularly in questions of what would happen at extremes that do not occur with the usual seasonal Major League data that many people analyze (for instance, RC works fine for normal teams, and so does 10 runs = 1 win as a rule of thumb. You don’t really need BsR or Pythagenpat for those types of situations--they can help sharpen your analysis, but you won’t go too far off track without them.) Thus my interest in run and win estimation at the extremes, as well as evaluation of extreme batters (yes, I still have about five installments in the Rate Stat series to write, and yes, I will get around to it, but when, I don’t know). Secondary to that is using sabermetrics to increase my understanding of the baseball world around me (example, how valuable is Chipper Jones? What are the odds that the Tigers win the World Series? Who got the better of the Brewers/Rangers trade?). I don't do this a whole lot here because there are dozens and dozens of people who do that kind of stuff, and I wouldn't be able to add any added insight. But a close third is using sabermetrics to evaluate the players and teams of the past. Particularly, I am interested in applying sabermetric analysis to the earliest days of what we now call major league baseball.

A few years ago, and again recently, I turned my attention to the National Association, the first loose major league of openly professional players that operated from 1871-1875. However, this league, as anyone who has attempted to statistically analyze it will know, was a mess. Teams played 40 games in a season; some dropped out after 10, some were horrifically bad, Boston dominated the league, etc. All of these factors make it difficult to develop the kind of sabermetric tools (run estimators, win estimators, baselines) that we use in present day analysis. So I finally threw my hands up and gave up (Dan Rosenheck came up with a BsR formula that worked better for the NA then anything I did, but there are limitations of the data that are hard to overcome). For now, it is probably best to eyeball the stats of NA players and teams and use common sense, as opposed to attempting to apply rigorous analytical structures to them.

Anyway, when things start to settle down, you have the National League, founded in 1876. I should note at this point that while I am interested in nineteenth-century baseball, I am by no means an expert on it, and so you should not be too surprised if I butcher the facts or make faulty assumptions, or call Cap Anson “Cap Anderson”. If you want a great historical presentation of old-time baseball, the best place to go is David Nemec’s The Great Encyclopedia of Nineteenth Century Major League Baseball. I believe that a revised edition of this book has been published recently, but I have the first edition. It is really a great book, similar in format to my favorite of the 20th century baseball encyclopedias, The Sports Encyclopedia: Baseball (or Neft/Cohen if you prefer). Like that work, only basic statistics are presented (no OPS+ or Pitching Runs, etc.), but you get the complete roster of each team each year, games by position, etc. And just like Neft/Cohen, there is a text summary of every season’s major stories, although Nemec writes these over the course of four or five pages, with pictures and trivial anecdotes, as opposed to the several paragraphs in the Neft/Cohen book. I wholeheartedly recommend the Nemec encyclopedia to anybody interested in the 19th century game.

That digression aside, the 1876 National League is still a different world then what we have today. The season is 60 games long, one team goes 9-56, pitchers are throwing from a box 45 feet away from the plate, it takes a zillion balls to draw a walk, overhand pitching is illegal, etc. But thankfully, you can make some sense of the statistics of this league, and while our tools don’t work as well, due to the competitive imbalance, the lack of important data that we have for later seasons, the shorter sample sizes as a result of a shorter season, etc., they can work to a level of precision that makes me comfortable to present their findings, with repeated caveats about how inaccurate they are compared to similar tools today. For the National Association, I could never reach that level of confidence.

What I intend to do over the course of this series is to look at the National League each season from 1876-1881. I chose 1881 for a couple reasons, the first being that during those seven seasons the NL had no other contenders to “major league” status (although many historians believe that other teams in other leagues would have been competitive with them--it's not like taking today’s Los Angeles Dodgers against the Vero Beach Dodgers). Also, in Bill James’ Historical Data Group Runs Created formulas, 1876-1881 is covered under one period (although 1882 and 1883 are included as well). That James found that he could put these seasons under one RC umbrella lead me to believe that the same could be done for BsR and a LW method as well. I will begin by looking at the runs created methodology here.

Run estimation is a little tricky as you go back in time. Unfortunately, there is no play-by-play database that we can use to determine empirical linear weights, and some important data is missing (SB and CS particularly). The biggest missing piece of the offensive puzzle though is reached base on error, which for simplicity’s sake I will just refer to as errors from hereon. In the 1880 NL, for instance, the fielding average was .901, and there were 8.67 fielding errors per game (for both teams). One hundred years later, the figures were .978 and 1.74. So you have something like five times as many errors being made as you do in the modern game.

When looking at modern statistics, you can ignore the error from an offensive perspective pretty safely. It will undoubtedly improve the accuracy of your run estimator if you can include it, but only very slightly, and the data is not widely available so we just ignore it, as we sometimes ignore sacrifice hits and hit batters and other minor events. But when there are as many errors as there were in the 1870s, you can’t ignore that. If you use a modern formula like ERP, and find the necessary multiplier, you will automatically inflate the value of all of the other events, because there has to be compensation somewhere for all of the runs being created as a result of errors.

So far as I know, there is only one published run estimator for this period. Bill James’ HDG-1 formula covers 1876-1883, and is figured as:
RC = (H + W)*(TB*1.2 + W*.26 + (AB-K)*.116)/(AB + W)
Bill decided to leave base runners as the modern estimate of H+W, and then try to account somewhat for errors by giving all balls in play extra advancement value. If you use the total offensive stats of the period to find the implicit linear weights, this is what you get:
RC = .730S + 1.066D + 1.402T + 1.739HR + .434W - .1081(AB - H - K) - .1406K

As you can see, the value of each event is inflated against our modern expectation of what they should be. I should note here that, of course, we don’t expect the 1870s weights to be the same as or even that similar to the modern weights. The coefficients do and should change as the game changes. That said, though, we have to be suspicious of a homer being valued at 1.74 runs and a triple at 1.40. The home run has a fairly constant value and it would take a very extreme context to lift its value so high. Scoring is high in this period (5.4 runs/game), but a lot of that logically has to be due to the extra errors. Three and a half extra errors per team game is like adding another 3.5 hits--it's going to be a factor in increased scoring.

To test RMSE for run estimators, I figured the error per (AB - H). I did this because I did not want the ever changing schedule length to unduly effect the RMSE. Of course, this does introduce the potential for problems because AB-H is much less a good proxy for outs in this period then it is today, as I will discuss shortly. I then multiplied the per out figure by 2153 (the average number of AB-H for a team in the 1876-1883 NL). In any case, doing this versus just taking the straight RMSE against actual runs scored did not make a big difference. Bill’s formula came in at 35.12 while the linearization was 30.65.

Of course what I wanted to do was figure out a Base Runs formula that worked for this period, as BsR is the most flexible and theoretically sound run estimator out there. What I decided to do was use Tango Tiger’s full modern formula and attempt to estimate some data that was missing and throw out other categories that would be much more difficult to estimate. I wound up estimating errors, sacrifice hits, wild pitches, and passed balls but throwing out steals, CS, intentional walks, hit batters, etc. Some of those events were subject to constantly changing rules and strategy (stolen bases and sacrifices were not initially a big part of the professional game) or didn’t even yet exist (Did teams issue intentional walks when it took 8 balls to give the batter first base? I am not a historian, but I doubt it. Hit batters did not result in a free pass until the 1887 in the NL). In the end, I came up with these estimates:

ERRORS: In modern baseball, approximately 65% of all errors result in a reached base on error for the offense. I (potentially dubiously) assumed that a similar percentage held in the 1870s, and used 70%. Then I simply figured x as 70% of the league fielding errors, per out in play (AB-H-K). x was allowed to be a different value for each season. Some may object to this as it hones in too much on the individual year and I certainly can understand such a position. However, the error rates were fluctuating during this period. In 1876 the league FA was .866; in 1877 it was up to .884; then .893, .892, .901, .905, .897, and .891. These differences are big enough to suggest that fundamental changes in the game may have been occurring from year-to-year.

James’ method had no such yearly correction, and if you force the BsR formula I will present later to use a constant x value of .134 (i.e. 13.4% of outs in play resulted in ROE), its RMSE will actually be around a run and a half higher then that of the linearization of RC. I still think that there are plenty of good reasons to use the BsR formula instead, but in the interests of intellectual honesty, I did not want to omit that fact.

It is entirely possible that a better estimate for errors could be found; there is no reason to assume that every batter is equally likely to reach on an error once they’ve made an out in play. In fact, I am sure that some smart mind could come along and come up with better estimates then I have in a number of different areas, and blow my formula right out of the water. I welcome further inquiry into this by others and look forward to my formula being annihilated. So don’t take any of this as a finished product or some kind of divine truth (not that you should with my other work either).

SACRIFICES: The first league to record sacrifices, so far as I can tell, was the American Association in 1883 and 1884. In those leagues, there was .0323 and .0327 SH per single, walk, and estimated ROE. So I assumed SH = .0325*(S + W + E) would be an acceptable estimate in the early NL. NOTE: Wow, did I screw the pooch on this one. The AA DID NOT track sacrifices in '83 and '84. I somehow misread the HB column as SH. We do no thave SH data until 1895 in the NL. So the discussion that follows is of questionable accuracy.

I did this some tie ago without thinking it through completely; in early baseball, innovations were still coming quickly, and it is possible that in the seven year interval, the sacrifice frequency changed wildly. George Wright recalled in 1915 (quoted in Bill James’ New Historical Baseball Abstract, pg. 10): “Batting was not done as scientifically in those days as now. The sacrifice hit was unthought of and the catcher was not required to have as good a throwing arm because no one had discovered the value of the stolen base.”

On the other hand, 1883 is pretty close to the end of our period, so while the frequency may well have increased over time, the estimate should at least be pretty good near the end of the line. One could also quibble with the choice of estimating sacrifices as a percentage of times on first base when, if sacrifices are not recorded, they are in actuality a subset of AB-H-K. Maybe an estimate based both on times on first and outs in play would work best. Again, there are a lot of judgment calls that go into constructing the formula, and so there are lots of areas for improvement.

WP and PB: These were kept by the NL, and there were .0355 WP per H+W-HR+E and .0775 PB per the same. So, the estimates are WP = .0355*(H + W - HR + E) and PB = .0775*(H + W - HR + E).

Then I simply plugged these estimates into Tango’s BsR formula. D of course was home runs, while A = H + W - HR + E + .08SH and C = AB - H - E + .92SH. The encouraging thing about this exercise was that the B factor only needed a multiplier of 1.087 (after including a penalty of .05 for outs) to predict the correct number of total runs scored. Ideally, if Base Runs was a perfect model of scoring (obviously it is not), we could use the same formula with any dataset, given all of the data, and not have to fudge the B component. The fact that we only had to fudge by 1.087 (compared to Bill James who to make his Basic RC work had to add walks into the B factor, take 120% of total bases, and add 11.6% of balls in play to B), could indicate that the BsR formula holds fairly well for this time when we add important, more common events like SH, errors, WP, and PB. Of course, perhaps Bill could get similar results using a more technical RC formula + estimation. The bottom line is, a fudge of only 1.087 will keep the linear weights fairly close to what we expect today. I don’t know for sure that they should be, but I’d rather error on the side of our expectations as opposed to a potentially quixotic quest to produce the lowest possible RMSE for a sample of sixty teams playing an average of 78 games each.

So the B formula is:
B = (.726S + 1.948D + 3.134T + 1.694HR + .052W + .799E + .727SH + 1.165WP + 1.174PB - .05(AB - H - E))*1.087

The RMSE of this formula by the standard above is 28.18. I got as low as 24.61 by increasing the outs weight to -.2, but I was not comfortable with the ramifications of this. As mentioned before, if one does not allow each year to have a unique ROE per OIP ratio, the RMSE is a much worse 32.20. Again, I feel a differently yearly factor is appropriate, but can certainly see if some feel this is an unfair advantage for this estimator when comparing it to others. The error of approximately 30 runs is a far cry from the errors around 23 in modern baseball, plus the season was shorter and the teams in this period averaged only 421 runs/season, so the raw number makes it seem smaller then it actually is. As I said before, you should always be aware of the inaccuracies when using any sabermetric method, but those caveats are even more important to keep in mind here.

Another way to consider the error is as a percentage of the runs scored by the team. This is figured as ABS(R-RC)/R. For sake of comparison, basic ERP, when used on all teams 1961-2002 (except 1981 and 1994), has an average absolute error of 2.7%. The BsR formula here, applied to all NL teams 1876-1883, has an AAE of 5.4%, twice that value. So once again I will stress that the methods used here are nowhere near as accurate as the similar methods used in our own time. Just for kicks, the largest error is a whopping 24.2% for the 1876 Cincinnati entry, which scored 238 runs but was projected to score 296. The best estimate is for Buffalo in 1882; they actually scored 500 versus a prediction of 501.

Before I move on too far, I have a little example that will illustrate the enormous effect of errors in this time and place. In modern baseball, there are pretty much exactly 27 outs per game, and approximately 25.2 of these are AB-H. We recognize, of course, that ROE in our own time are included in this batting out figure, and should not be, but any distortion is small and can basically be ignored.

Picking a random year, in the 1879 NL, we know that there were 27.09 outs/game since we have the innings pitched figure. How many batting outs were there per game? Well, if the modern rule of thumb held, there should be just about 25.2. There were 28.01. So there are more batting outs per game then there are total outs in the game. With our error estimate subtracted (so that batting outs = AB - H - E), we estimate 24.60. Now this may well be too low, or just right, or what have you. Maybe I it should have been 50% of errors put a runner on first base instead of 70%. I don’t know. What I do know is that if you pretend errors do not exist, you are going to throw all of your measures for this time and place out of whack. Errors were too big of a factor in the game to just be ignored as we can do today.

Let’s take a look at the linear values produced by the Base Runs formula, as applied to the entire period:
Runs = .551S + .843D + 1.126T + 1.404HR + .390W + .569E + .081SH + .280PB + .278WP - .145(AB - H - E)

This is why I felt much more comfortable with the BsR formula I chose, despite the fact that there were versions with better accuracy. These weights would not be completely off-base if we found them for modern baseball. Whether or not they are the best weights for 1876-1883, we will have to wait for when brighter minds tackle the problem or when PBP data is available and we can empirically see what they are. But to me, it is preferable to accept greater error in team seasonal data but keep our common sense knowledge of what events are worth rather then to chase greater accuracy but distort the weights.

This is still not the formula that I am going to apply to players, though. For that, I will use the linear version for that particular season. Additionally, for players, SH, PB, and WP will be broken back down into their components. What I mean is that we estimate that a SH is worth .081 runs, and we estimated that there are .0325 SH for every S, W, and E. .081*.0325 = .0026, and therefore, for every single, walk, and error we’ll add an additional .0026 runs. So a single will be worth .551+.0026 = .554 runs. We’ll also distribute the PB and WP in a similar way.

There are some drawbacks to doing it this way. If Ross Barnes hits 100 singles, his team may in fact lay down 3.25 more sacrifices. But it will be his teammates doing the sacrificing, not him. And we would assume that good hitters would sacrifice less then poor hitters, and this method assumes they are all doing it equally.

On the other hand, though, we are just doing something similar in spirit to what a theoretical team approach does--crediting the change in the team’s stats as a direct result of the player to the player. Besides, there’s really no other fair way to do it (we don’t want to get into estimating SH as a function of individual stats, and even if we did, we have no individual SH data for this period to test against). Also, in the end, the extra weight added to each event will be fairly small, and I am much more comfortable doing it with the battery errors which should be fairly randomly distributed with regards to which particular player is on base when they occur.

Then there is the matter of the error. Since the error is done solely as a function of AB-H-K, we could redistribute it, and come up with a different value for a non-K out and a K out, and write errors out of the formula, and have a mathematically equivalent result. However, I am not going to do this because I believe that, as covered previously, errors are such an important part of this game that we should recognize them, and maybe even include them in On Base Average (I have not in my presentation here, but I wouldn’t object if someone did) in order to remember that they are there. I think that keeping errors in the formula gives a truer picture of the linear weight value of each event as well, as it allows us to remember that the error is worth a certain number of runs and that outs, actual outs, have a particular negative value. Hiding this by lowering the value of an out seems to erase information to me.

I mentioned earlier that each year will have a different x to estimate errors in the formula x(AB-H-K). They are: 1876 = .1531, 1877 = .1407, 1878 = .1368, 1879 = .1345, 1880 = .1256, 1881 = .1184.

At this point, let me present the weights for the league as a whole in each year in 1876-1881, and then the ones with SH, PB, and WP stripped out and reapportioned across the other events. The first set is presented as (S, D, T, HR, W, E, AB-H-E, SH, PB, WP). The second is presented as (S, D, T, HR, W, E, AB-H-E).

1876: .552, .853, 1.146, 1.417, .386, .570, -.147, .085, .289, .287
1876: .588, .886, 1.178, 1.417, .422, .606, -.147
1877: .563, .862, 1.152, 1.414, .398, .581, -.153, .079, .287, .285
1877: .598, .894, 1.184, 1.414, .433, .616, -.153
1878: .546, .846, 1.138, 1.417, .380, .564, -.144, .087, .289, .287
1878: .581, .879, 1.171, 1.417, .415, .599, -.144
1879: .543, .830, 1.108, 1.397, .385, .560, -.140, .082, .275, .273
1879: .577, .861, 1.139, 1.397, .419, .594, -.140
1880: .537, .825, 1.105, 1.400, .378, .554, -.137, .086, .277, .275
1880: .571, .856, 1.136, 1.400, .412, .588, -.137
1881: .560, .859, 1.149, 1.415, .395, .578, -.151, .080, .287, .285
1881: .595, .891, 1.182, 1.415, .430, .613, -.151

Next installment, I’ll talk a little bit about replacement level, the defensive spectrum, and park factors.


  1. An excellent series of articles I must say. I noticed with BsR that it's pretty accurate dating back to World War II. It's interesting that this coincides with a .005 gain in Fielding Percentage and a 10% increase in strikeouts after WWII.

  2. Thanks. One of the things I intend to get done eventually is BsR formulas for 1883-1953 or so. More errors is certainly one of the things that can throw any run estimator off.


I reserve the right to reject any comment for any reason.