Wednesday, December 16, 2020

Akousmatikoi Win Estimators, pt. 1: Pythagorean

This series will be a brief review of the Pythagorean methodology for estimating team winning percentage from runs scored and runs allowed, and will examine a number of alternative winning percentage estimators that can be derived from the standard Pythagorean approach. I call it a “review” because I will not be presenting any new methods – in fact, not only was everything I plan to cover discovered and published by other sabermetricians, but it is all material that I have already written about in one form or another. When recently posting old articles from my Tripod site, I saw how poorly organized the section on win estimators was, and decided that I should try to write a cleaner version that focuses on the relationship between the Pythagorean approach and other mathematical forms for win estimators. This series will start from the assumption that Pythagorean is a useful model; I don’t think this is a controversial claim but a full treatment would need to establish that before jumping into mathematical offshoots.

By christening his win estimator the “Pythagorean Theorem” due to the three squared terms in the formula reminding him of the three squared terms Pythagoras discovered defined the dimensions of right triangles, Bill James made it irresistible for future writers to double down with even more ridiculous names. I am sure any students of Greek philosophy are cursing me, but I am calling this the “Akousmatikoi” family of win estimators because Wikipedia informs me that Akousmatikoi was a philosophical school that was a branch of the larger school of Pythagoreanism based on the teachings of Pythagoras. A rival branch, the Mathematikoi school, was more focused on the intellectual and mathematical aspects of Pythagorean thought, which would make it a better name for my purposes, but even I think that sounds too ridiculous. I’ve also jumbled the analogy as James’ Pythagorean theorem is the starting point for the Akousmatikoi family of estimators but Pythagoras begat this school of philosophy, but not the other way around. Of course, James’ Pythagorean theorem really has nothing to do with Pythagoras to begin with, so don’t think too hard about this.

Before I get started, I want to make certain that I am very clear that I’m introducing nothing new and that while I will derive a number of methods from Pythagorean, the people who originally discovered and published these methods used their own thought processes and ingenuity to do so. They did not simply derive them from Pythagorean. I will try to namecheck them throughout the series, but will also do it here in case I slip up – among the sabermetricians who developed the methods that I will treat as Pythagorean offshoots independently are Bill Kross, Ralph Caola, and Ben Vollmayr-Lee.

I also want to briefly address the win estimators that are in common use that are not part of what I am calling the Akousmatikoi family. The chief one that I use is Cigol, which is my implementation of a methodology that starts with an assumed run distribution per game and calculates W% from there (I say “calculates” rather than “estimates” because given the assumptions about per game and per inning run distribution functions, it is a logical mathematical derivation, not an estimate. Of course, the assumptions are just that). Cigol is very consistent with the results of Pythagenpat for teams across a wide range of scoring environments, but is its own animal. There are also approaches based on regression that offer non-Akousmatikoi paths to win estimates. If you regress on run differential or run ratio, your results will look similar to Akousmatikoi, but if you take the path of Arnold Soolman’s pioneering work and regress runs and runs allowed separately, or you use logistic regression or another non-linear methodology, your results won’t be as easily relatable to the Akousmatikoi methods.

It all starts with Pythagorean, which Bill James originally formulated as:

W% = R^2/(R^2 + RA^2)

The presence of three squared terms reminded James of the real Pythagorean theorem for the lengths of the side of right triangle (A^2 = B^2 + C^2) and gave us the charmingly wacky name for this method of win estimation. James would later complicate matters by noting that a lower exponent resulted in a slight increase in accuracy:

W% = R^1.83/(R^1.83 + RA^1.83)

Later research by Clay Davenport and Keith Woolner would demonstrate that a custom exponent, varying by run environment, would result in better accuracy in extreme situations. Pete Palmer had long before demonstrated that his linear methods increased in accuracy when considering run environment; “Pythagenport” brought this insight to Pythagorean, which we’ll now more generally express as:

W% = R^x/(R^x + RA^x)

Where Pythagenport estimates x = 1.5*log(RPG) + .45, where RPG = (R + RA)/G

Davenport and Woolner stated that the accuracy of Pythagenport was untested for RPG less than 4. A couple years later, David Smyth had the insight that 1 RPG was a situation that could only occur if the score of each game was 1-0, and that such a team’s W% would by definition be equal to R/(R + RA). This implies that the Pythagorean exponent must be 1 when RPG = 1. Based on this insight, Smyth and I independently developed a modified exponent which was constructed as:

x = RPG^z

where z is a constant generally in the range of .27 - .29 (I originally published as .29 and have tended to use this value out of habit, although if you forced me to pick one value and stick to it I’d probably choose .28)

This approach produced very similar results to Pythagenport for the RPG ranges tested by Davenport and Woolner, and returned the correct result for the known case at RPG = 1. It has come to be called “Pythagenpat”.

Using Cigol, I tried to develop a refined formula for Pythagorean exponent using data for truly extreme temas. I loosened the restriction on requiring x = 1 when RPG = 1 to be able to consider a wider range of models, but I wasn’t able to come up with a version that produced superior accuracy with a large dataset of actual major league team-seasons to the standard Pythagenpat construction. My favorite of the versions I came up are below, which I won’t dwell on any longer but will revisit briefly at the end of the series. The first is a Pythagenpat exponent that produces a Pythagorean exponent of 1 at 1 RPG; the second is a Pythagorean exponent that does not adhere to that restriction.

z = .27348 + .00025*RPG + .00020*(R - RA)^2

x = 1.03841*RPG^.265 + .00114*RD^2

There are several properties of a Pythagorean construct that make it better suited as a starting point (standing in for the “true” W% function, if there could ever be such a thing) than some of the other methods we’ll look at. I have previously proposed a list of three ideal properties of a W% estimator:

1. The estimate should fall in the range [0,1]
2. The formula should recognize that the marginal value of runs is variable.
3. The formula should recognize that as more runs are scored, the number of marginal runs needed to earn a win increases.

As we move throughout this series, we will make changes to simplify the Pythagenpat function in some ways; in my notes I called it “flattening”, but that’s not a technical term. Basically, where we see exponents, we will try to convert into multiplication, or we will try to use run differential in place of run ratio. As we “flatten” the functions out, we will progressively lose some of these ideal properties, with the (usual) benefit of having simpler functions.

Throughout this series I will make sporadic use of the team seasonal data for the expansion era (1961 – 2019), so at this point I want to use this dataset to define the Pythagorean constants that we’ll use going forward. Rather than using any formulaic approach, I am going to fix x and z for this period by minimizing the RMSE of the W% estimates for the teams in the dataset. I will also use the fixed Pythagorean exponent of 2 throughout the series as it is easy to calculate, reasonably accurate, widely used, and mathematically will produce some pleasing results for the other Akousmatikoi estimators.

Using this data, the average RPG is 8.83, the value for x that minimizes RMSE is 1.847, and the z value that minimizes RMSE is .282. Note that if we used the average RPG to estimate the average Pythagorean exponent, we’d get 1.848 (8.83^.282), which doesn’t prove anything but at least it’s not way off.

No comments:

Post a Comment

I reserve the right to reject any comment for any reason.