Wednesday, December 16, 2020

Akousmatikoi Win Estimators, pt. 1: Pythagorean

This series will be a brief review of the Pythagorean methodology for estimating team winning percentage from runs scored and runs allowed, and will examine a number of alternative winning percentage estimators that can be derived from the standard Pythagorean approach. I call it a “review” because I will not be presenting any new methods – in fact, not only was everything I plan to cover discovered and published by other sabermetricians, but it is all material that I have already written about in one form or another. When recently posting old articles from my Tripod site, I saw how poorly organized the section on win estimators was, and decided that I should try to write a cleaner version that focuses on the relationship between the Pythagorean approach and other mathematical forms for win estimators. This series will start from the assumption that Pythagorean is a useful model; I don’t think this is a controversial claim but a full treatment would need to establish that before jumping into mathematical offshoots.

By christening his win estimator the “Pythagorean Theorem” due to the three squared terms in the formula reminding him of the three squared terms Pythagoras discovered defined the dimensions of right triangles, Bill James made it irresistible for future writers to double down with even more ridiculous names. I am sure any students of Greek philosophy are cursing me, but I am calling this the “Akousmatikoi” family of win estimators because Wikipedia informs me that Akousmatikoi was a philosophical school that was a branch of the larger school of Pythagoreanism based on the teachings of Pythagoras. A rival branch, the Mathematikoi school, was more focused on the intellectual and mathematical aspects of Pythagorean thought, which would make it a better name for my purposes, but even I think that sounds too ridiculous. I’ve also jumbled the analogy as James’ Pythagorean theorem is the starting point for the Akousmatikoi family of estimators but Pythagoras begat this school of philosophy, but not the other way around. Of course, James’ Pythagorean theorem really has nothing to do with Pythagoras to begin with, so don’t think too hard about this.

Before I get started, I want to make certain that I am very clear that I’m introducing nothing new and that while I will derive a number of methods from Pythagorean, the people who originally discovered and published these methods used their own thought processes and ingenuity to do so. They did not simply derive them from Pythagorean. I will try to namecheck them throughout the series, but will also do it here in case I slip up – among the sabermetricians who developed the methods that I will treat as Pythagorean offshoots independently are Bill Kross, Ralph Caola, and Ben Vollmayr-Lee.

I also want to briefly address the win estimators that are in common use that are not part of what I am calling the Akousmatikoi family. The chief one that I use is Cigol, which is my implementation of a methodology that starts with an assumed run distribution per game and calculates W% from there (I say “calculates” rather than “estimates” because given the assumptions about per game and per inning run distribution functions, it is a logical mathematical derivation, not an estimate. Of course, the assumptions are just that). Cigol is very consistent with the results of Pythagenpat for teams across a wide range of scoring environments, but is its own animal. There are also approaches based on regression that offer non-Akousmatikoi paths to win estimates. If you regress on run differential or run ratio, your results will look similar to Akousmatikoi, but if you take the path of Arnold Soolman’s pioneering work and regress runs and runs allowed separately, or you use logistic regression or another non-linear methodology, your results won’t be as easily relatable to the Akousmatikoi methods.

It all starts with Pythagorean, which Bill James originally formulated as:

W% = R^2/(R^2 + RA^2)

The presence of three squared terms reminded James of the real Pythagorean theorem for the lengths of the side of right triangle (A^2 = B^2 + C^2) and gave us the charmingly wacky name for this method of win estimation. James would later complicate matters by noting that a lower exponent resulted in a slight increase in accuracy:

W% = R^1.83/(R^1.83 + RA^1.83)

Later research by Clay Davenport and Keith Woolner would demonstrate that a custom exponent, varying by run environment, would result in better accuracy in extreme situations. Pete Palmer had long before demonstrated that his linear methods increased in accuracy when considering run environment; “Pythagenport” brought this insight to Pythagorean, which we’ll now more generally express as:

W% = R^x/(R^x + RA^x)

Where Pythagenport estimates x = 1.5*log(RPG) + .45, where RPG = (R + RA)/G

Davenport and Woolner stated that the accuracy of Pythagenport was untested for RPG less than 4. A couple years later, David Smyth had the insight that 1 RPG was a situation that could only occur if the score of each game was 1-0, and that such a team’s W% would by definition be equal to R/(R + RA). This implies that the Pythagorean exponent must be 1 when RPG = 1. Based on this insight, Smyth and I independently developed a modified exponent which was constructed as:

x = RPG^z

where z is a constant generally in the range of .27 - .29 (I originally published as .29 and have tended to use this value out of habit, although if you forced me to pick one value and stick to it I’d probably choose .28)

This approach produced very similar results to Pythagenport for the RPG ranges tested by Davenport and Woolner, and returned the correct result for the known case at RPG = 1. It has come to be called “Pythagenpat”.

Using Cigol, I tried to develop a refined formula for Pythagorean exponent using data for truly extreme temas. I loosened the restriction on requiring x = 1 when RPG = 1 to be able to consider a wider range of models, but I wasn’t able to come up with a version that produced superior accuracy with a large dataset of actual major league team-seasons to the standard Pythagenpat construction. My favorite of the versions I came up are below, which I won’t dwell on any longer but will revisit briefly at the end of the series. The first is a Pythagenpat exponent that produces a Pythagorean exponent of 1 at 1 RPG; the second is a Pythagorean exponent that does not adhere to that restriction.

z = .27348 + .00025*RPG + .00020*(R - RA)^2

x = 1.03841*RPG^.265 + .00114*RD^2

There are several properties of a Pythagorean construct that make it better suited as a starting point (standing in for the “true” W% function, if there could ever be such a thing) than some of the other methods we’ll look at. I have previously proposed a list of three ideal properties of a W% estimator:

1. The estimate should fall in the range [0,1]
2. The formula should recognize that the marginal value of runs is variable.
3. The formula should recognize that as more runs are scored, the number of marginal runs needed to earn a win increases.

As we move throughout this series, we will make changes to simplify the Pythagenpat function in some ways; in my notes I called it “flattening”, but that’s not a technical term. Basically, where we see exponents, we will try to convert into multiplication, or we will try to use run differential in place of run ratio. As we “flatten” the functions out, we will progressively lose some of these ideal properties, with the (usual) benefit of having simpler functions.

Throughout this series I will make sporadic use of the team seasonal data for the expansion era (1961 – 2019), so at this point I want to use this dataset to define the Pythagorean constants that we’ll use going forward. Rather than using any formulaic approach, I am going to fix x and z for this period by minimizing the RMSE of the W% estimates for the teams in the dataset. I will also use the fixed Pythagorean exponent of 2 throughout the series as it is easy to calculate, reasonably accurate, widely used, and mathematically will produce some pleasing results for the other Akousmatikoi estimators.

Using this data, the average RPG is 8.83, the value for x that minimizes RMSE is 1.847, and the z value that minimizes RMSE is .282. Note that if we used the average RPG to estimate the average Pythagorean exponent, we’d get 1.848 (8.83^.282), which doesn’t prove anything but at least it’s not way off.

Thursday, December 03, 2020

Palmerian Park Factors

The first sabermetrician to publish extensive work on park effects was Pete Palmer. His park factors appeared in The Hidden Game of Baseball and Total Baseball and as such became the most-widely used factors in the field. They continue in this role thanks to their usage at Baseball-Reference.com.

Broadly speaking, all ratio-based run park factors are calculated in the same manner. The starting point is the ratio of runs at home to runs on the road. There are a number of different possible variations; some methods use runs scored by both teams, while others (including Palmer’s original methodology in The Hidden Game use only the runs scored by one of the teams (home or road)). The opportunity factor can also vary slightly; many people just use games, which is less precise than innings or outs. The variations are mostly technical rather than philosophical in nature, so they rarely get a lot of attention. Park factor calculations are accepted at face value to an extent that other classes of sabermetric tools (like run estimators) are not.

Among park factors in use, the actual computations in Palmer’s are the most unique, so I thought it would be worthwhile to walk through his methodology (as stated in Total Baseball V) and discuss its properties.

In the course of this discussion, I will primarily focus on the actual calculations and not the inputs. Where Palmer has made choices about what inputs to use, how to weight multiple seasons, and the like, I will not dwell, as I’m more interested in the aspects of the approach that can be applied to one’s own choice of inputs. Palmer uses separate park factors for batting and pitching (more on this later); I’ll focus on the batting ones here.

Palmer generally uses three years of data, unweighted, as the basis for the factors. There are some rules about which years to use when teams change parks, but those are not relevant to this discussion. The real meat of the method starts by finding the total runs scored and allowed per game at home and on the road, which I’ll call RPG(H) and RPG(R).

i (initial factor) = RPG(H)/RPG(R)

I will be using the 2010 Colorado Rockies as the example team here, considering just one year of data to keep things simple. Colorado played 81 games both home and away, scoring 479 and allowing 379 runs at home and scoring 291 and allowing 338 on the road. Thus, their RPG(H) = (479 + 379)/81 = 10.593 and RPG(R) = (291 + 338)/81 = 7.765. That makes i = 10.593/7.765 = 1.364 (I am rounding to three places throughout the post, which will cause some rounding discrepancies with the spreadsheet from which I am reporting the results).

The next step is to adjust the initial factor for the number of innings actually played rather than just using games as the denominator. This step can be ignored if you begin with innings or outs as the denominator rather than using games. The Innings Pitched corrector is:

IPC = (18.5 - Home W%)/(18.5 - (1 - Road W%))

Palmer explains that 18.5 is the average number of innings batted per game if the home team always bats in the ninth inning. Teams that win a higher percentage of games at home bat in less innings due to skipping the bottom of the ninth. The IPC seems to assume that in all games won by the home team, they do not bat in the bottom of the ninth.

Colorado was 52-29 (.642) at home and 31-50 on the road (.383), so their IPC is:

IPC = (18.5 - .642)/(18.5 - (1 - .383)) = .999

The initial factor is divided by the IPC to produce what the explanation refers to as Run Factor:

RF = i/IPC

For the Rockies, RF = 1.364/.999 = 1.366

The next step is the Other Parks Corrector (OPC). The OPC “[corrects] for the fact that the other road parks’ total difference from the league average is offset by the park rating of the club that is being rated.” The glossary explanation may be confusing, but the thought process behind it is pretty straightforward--a team’s own park makes up part of the league road average, but none of the team’s own road games are played there. Without accounting for this, all parks would be estimated to be more extreme than they are in reality.

The OPC is figured in the same manner as I do it in my park factors; I borrowed it from Craig Wright’s work without realizing that Palmer had done the same mathematical operation, but its derivation is fairly obvious and the equivalent appears in multiple park factor approaches. Let T equal the number of teams in the league:

OPC = T/(T - 1 + RF)

Basically, OPC assumes a balanced schedule, so each team’s schedule is made up half of games in its own park (hence the use of RF) and half in the other parks, of which there are T - 1. For a sixteen team league (and thus Colorado 2010):

OPC = 16/(16 - 1 + 1.366) = .978

The next step is to multiply RF by OPC, producing scoring factor:

SF = RF*OPC

For Colorado, SF = 1.366*.978 = 1.335

If all you were interested in was an adjustment factor of the park effect’s on scoring, rather than specific adjustments for the team’s batters and pitchers, this would be your stopping point. The scoring factor is the final park factor in that case, and with the exception of the Innings Pitched Corrector, it is equivalent to the approach used by Craig Wright, myself, and many others. (My park factors are then averaged with one to account for the fact that only half of the games for a given team are at home, but that only obscures the fact that the underlying approach is identical, and Palmer accounts for that consideration later in his process).

It is when the other factors are adjusted for that the math gets a little more involved. The first step is to calculate SF1, which is an adjustment to scoring factor:

SF1 = 1 - (SF - 1)/(T - 1)

For the 2010 Rockies:

SF1 = 1 - (1.335 - 1)/(16 - 1) = .978

While I am writing this explanation, I must stress that it is a walkthrough of another person’s method. I cannot fully explain the thought process behind it and justify every step. I decided to include that disclaimer at this point because I don’t understand what the purpose of SF1 is, as it is mathematically equivalent to OPC. Why it needed to be defined again and in a more obtuse way is beyond me.

In any event, the purpose of SF1 is to serve as a road park factor. If a team plays a balanced schedule and we take as a given that the overall league park factor should be equal to one, but we determine that its own park has a PF greater than one (favors hitters), then it must be the case that the road parks they play in have a composite PF less than one. That’s the function of the OPC, and of SF1. For the rest of this post, I will refer to OPC rather than SF1 when it is used in formulas.

Palmer uses separate factors for batters and pitchers to account for the fact that a player does not have to face his teammates. By doing so, Palmer’s park factors make their name something of a misnomer as they adjust for things other than the park. (One could certainly make the case that the park factor name is constantly misapplied in sabermetrics, as we can never truly isolate the effect of the park, and the sample data we have is affected by personnel and other decisions. Palmer takes it a step further, though, by accounting for things that have nothing to do with the park.) The park effect is generally stronger than the effect of not facing one’s own teammates, since a team plays in its park half the time but only misses out on facing its own pitchers 1/T percent of the time assuming a balanced schedule.

One can argue about the advisability of adjusting for the teammate factor at all, and if so it certainly is debatable whether it should be included in the park factor or spun off as a separate adjustment. I would find the later to be a much better choice that would result in a much more meaningful set of factors (both for the park and teammates), but Palmer chose the former.

The separate factors are calculated through an iterative process. One must know the R/G scored and allowed at home and away for the team (I’ll call these RG(H), RG(R), RAG(H), and RAG(R) for R/G at home, R/G on the road, RA/G at home, and RA/G on the road respectively). Additionally, one must know the average RPG for the entire league (which I’ll call 2*N, since I often use N to denote league average runs/game for one team). For the Rockies, we can determine from the data above that RG(H) = 5.913, RG(R) = 3.593, RAG(H) = 4.679, RAG(R) = 4.173, and N = 4.33.

The iterative process is used to calculate a team batter rating (TBR) and a team pitcher rating (TPR, not to be confused with Total Player Rating, another Palmer acronym). These steps are necessary since the strength of each unit cannot be determined wholly independently of the other. Just as the total impact of park goes beyond the effect on home games and into road games (necessitating the OPC, but not being of strong enough magnitude to swamp over the home factor), so the influence of the two units on another are codependent.

The first step of the process assumes that the pitching staff is average (TPR = 1). Then the TBR can be calculated as:

TBR = [RG(R)/OPC + RG(H)/SF]*[1 + (TPR - 1)/(T - 1)]/(2*N)

For Colorado:

TBR = [3.593/.978 + 5.913/1.335]*[1 + (1 - 1)/(16 - 1)]/(2*4.33) = .936

The first bracketed portion of the equation adds the teams’ adjusted (for park) R/G at home and on the road, using the home (SF) or road (OPC) park factor as appropriate. The second set of brackets multiplies the first by the adjustment for not facing the team’s pitching staff. The difference between the team pitching and average (TPR - 1) is divided by (T - 1) since playing a true balanced schedule, the team would only face those pitchers in 1/(T - 1) percent of the games anyway. If the team’s pitchers are above average (TPR < 1), then the correction will increase the estimate of team batting strength.

Then the entire quantity is divided by double the league average of runs scored per game by a single team. This may seem confusing at first glance, but it is only because the first bracketed portion did not weight the home and road R/G by 50% each. The formula could be written as the equivalent:

TBR = [.5*RG(R)/OPC + .5*RG(H)/SF]*[1 + (TPR - 1)/(T - 1)]/N

In this case, it is much easier to see that the first bracket is the average runs scored per game by the team, park-adjusted. Dividing this by the league average results in a very straightforward rating of runs scored relative to the league average. Both TBR and TPR are runs per game relative to the league average, but in both cases they are constructed as team figure/league figure. This means that the higher the TBR, the better, with the opposite being true for TPR.

The pitcher rating can then be estimated using the actual TBR just calculated, rather than assuming that the batters are average:

TPR = [RAG(R)/OPC + RAG(H)/SF]*[1 + (TBR - 1)/(T - 1)]/(2*N)

For the Rockies, this yields:

TPR = [4.173/.978 + 4.679/1.335]*[1 + (.936 - 1)/(16 - 1)]/(2*4.33) = .894

The first estimate is that the Rockies batters, playing in a neutral park and against a truly balanced schedule, would score 93.6% of the league average R/G. Rockie pitchers would be expected to allow 89.4% of the league average R/G. At first when I was performing the sample calculations, I thought I might have made a mistake since the Colorado offense was evaluated as so far below average, but I was forgetting that they would appear quite poor using a one-year factor for Coors Field. My instinct as to what Colorado’s TBR should look like was informed by my knowledge of the five-year PF that I use.

Now the process is repeated for three more iterations, each time using the most recently calculated value for TBR or TPR as appropriate. The second iteration calculations are:

TBR = [3.593/.978 + 5.913/1.335]*[1 + (.894 - 1)/(16 - 1)]/(2*4.33) = .929

TPR = [4.173/.978 + 4.679/1.335]*[1 + (.929 - 1)/(16 - 1)]/(2*4.33) = .893

Repeating the loop two more times should ensure pretty stable values. Theoretically, there’s nothing to stop you from setting up an infinite loop. I won’t insult your intelligence by spelling the second pair of iterations suggested by Palmer, but the final results for Colorado are a TBR of .929 and a TPR of .993.

So far, all we’ve done is figured teammate-corrected runs ratings for team offense and defense. Our estimate of the park factor remains stuck back at scoring factor. All that’s left now is correcting scoring factor for 1) the teammate factor and 2) the fact that a team plays half its games at home and half on the road. This will require two separate formulas--a batter’s PF (BPF) and a pitcher’s PF (PPF), which are twins in the same way that TBR and TPR are:

BPF = (SF + OPC)/[2*(1 + (TPR - 1)/(T - 1)]

PPF = (SF + OPC)/[2* (1 + (TBR - 1)/(T - 1)]

For Colorado:

BPF = (1.335 + .978)/[2*(1 + (.893 - 1)/(16 - 1)] = 1.165

PPF = (1.335 + .978)/[2*(1 + (.929 - 1)/(16 - 1)] = 1.162

The logic behind the two formulas should be fairly obvious at this point. Again, the multiplication by two in the denominator arises because the two factors in the numerator were not weighted at 50% each. You can write the formulas as:

BPF = (.5*SF + .5*OPC)/[1 + (TPR - 1)/(T - 1)]

PPF = (.5*SF + .5*OPC)/[1 + (TBR - 1)/(T - 1)]

Here, you can see more clearly that the numerator averages the home park factor (SF) and the road park factor (OPC). The numerator could be your final park factor if you wanted to account for road park but didn’t care about teammate effects. The denominator is where the teammate effect is taken into account, and in the same manner as in TBR and TPR. If TPR > 1, BPF goes down because the batters did not benefit from facing the team’s poor pitching. If TBR > 1, PPF goes down pitchers benefitted from not facing their batting teammates.

If one does not care to incorporate the teammate effect, then Palmer’s park factors are pretty much the same as any other intelligently designed park factors. This should not come as a surprise, because as with many implementations of classical sabermetrics, Palmer’s influence is towering. The iterative process used to generate the batting and pitching ratings is pretty clever, and incorporates a real albeit small effect that many of us (myself included) often gloss over.