Wednesday, September 26, 2018

Enby Distribution, pt. 8: Cigol at the Extremes--Pythagorean Exponent

Among the possible choices, the Pythagorean family of W% estimators is by far the dominant species in the win estimator genus. While I’m sure that anyone reading this is aware, just to be sure, the Pythagorean family takes the form:

W% = R^x/(R^x + RA^x)

While as a co-purveyor of one of the variants I am not exactly unbiased in this matter, here are a few reasons as to why this family dominates sabermetric usage:

1. The Bill James effect--The number one reason why Pythagorean estimators are widely used is because of Bill James. Had James used a RPW method as Pete Palmer did, I would still be writing soapbox-y blog posts about why some other form made more sense (as I still do from time to time on the matter on run estimation, in which Palmer’s form has finally won the day over James). Had James not used Pythagorean, it is possible that whatever non-linear win estimator was widely used in sabermetrics (and one doubtlessly would have been developed) would take on a different form than Pythagorean.

2. Naturally bounded at zero and one--Winning percentage is by its nature bounded at zero and one. The Pythagorean form inherently captures this reality. Had the trail in this area been blazed by statisticians rather than James, we might have gotten a logit or probit regression equation that did the same, just to name a couple of common functions that also are bounded by zero and one. In order to have a theoretically justifiable formula, the bounds must be observed, and Pythagorean is a fairly straight forward way to do it.

3. Non-linearity reflects reality--It can be demonstrated even with “extreme” but actual major league teams that the run-to-wins relationship is non-linear. Pythagorean may not capture this perfectly, but it seems right to account for it in some way. This is one reason why people still cling to Runs Created after it was shown to be inaccurate (particularly before Base Runs, which fills the void, had been popularized)--people inherently realize that run scoring is a non-linear process, and are more comfortable with a method that recognizes that, even if it captures the effect in a very flawed manner.

James’ original versions used fixed exponents (x = 2 , refined to x = 1.83) but the breakthrough research on factoring scoring context into the equation was performed by Clay Davenport and Keith Woolner at Baseball Prospectus, who found that an exponent x = 1.5*log(RPG) + .45 worked well when RPG was greater than 4. This variant is known as Pythagenport. A couple years later, David Smyth realized that the minimum possible RPG was one, since a game will continue indefinitely until one side wins (which requires scoring one run), and that if a team had a RPG of one, their exponent would have to be equal to one. Based on this insight, Smyth and I were able to both independently find a form that returned an estimate of x = 1 at RPG = 1 and also estimates similar to Pythagenport for normal teams. This form has become known as Pythagenpat.

Let’s begin by trying to find an equation to estimate the exponent based on RPG from our Cigol estimates. In order to do this, we first need to be able to solve for the exponent x from W% and Run Ratio:

W/L = (R/RA)^x is a restatement of the generic Pythagorean equation W% = R^x/(R^x + RA^x)

thus x = log(W/L)/log(R/RA)

which when working with W% can be expressed equivalently as:

x = log(W%/(1 - W%))/log(R/RA)

We can now attempt to fit a regression line to predict x from RPG based on the Cigol estimates. For illustration, I’ll start with the full data discussed last time rather than what I’ll call the limited set (the limited set is limited to normal-ish major league teams--W%s between .3 and .7 with R/G and RA/G between 3 and 7):



This graph is not very helpful, but one can see the general shape, which can be approximated by a logarithmic curve as noted by Davenport and Woolner. I’ve gone ahead and included the logarithmic regression line per Excel, but you’ll note that it uses natural log rather than base-10 log as in Pythagenport. Running a regression on log(RPG) results in this equation:

x = 1.346*log(RPG) + .596

That is a relatively decent match for Pythagenport--the two equations produce essentially the same exponent at normal RPGs (for example, for 9 RPG the Pythagenport exponent is 1.881 and the regression exponent is 1.880). At lower RPGs, the higher intercept in the regression equation allows the estimate to be closer to one at the known point, but it still falls well short of matching the known point value of one.

Just to be complete, we can also look at how this relationship plays out in the limited set:



Here, the base-10 equation is:

x = 1.324*log(RPG) + .580

One thing that is interesting to note is that in the last installment, when we focused on estimating Runs Per Win, the regression equations were quite different depending on which dataset was being used. Here, whether looking at the full scope of teams or the limited set, the regression equations are quite close. This implies that the manner in which we are expressing W% (Pythagorean) is closer to capturing the real relationship between scoring context and W% than is the RPW model. If there existed a perfect model, it would have the same equation regardless of which data was used to calibrate it. While Pythagorean is not a perfect model, it exhibits a consistency that the run differential-based model does not.

As the graphs illustrate, the relationship between RPG and x appears to follow a logarithmic curve, and so it is quite understandable that Davenport and Woolner chose this form. However, Smyth and I both found that a power regression also provided a nice fit for the curve (for example, this is the result for the all teams Cigol estimate):



The power estimate does an excellent job of matching the Cigol-implied exponent at very low levels of RPG. Mathematically, it works well for this task since one raised to any power is equal to one. Since the logarithm of one is zero, the logarithmic form would only be able to match reality at 1 RPG by setting the intercept equal to one, which would distort results at higher RPG values.

As RPG grows large, though, the power model begins overestimating the exponent, while the logarithmic model provides a tighter fit. From a practical standpoint, performance at low levels of RPG is much more important than performance at high levels of RPG, since extremely low RPGs are much more likely to exist in the majors. As a stickler for theoretical accuracy, though, it is a bit troubling to see that the power regression and Cigol are not a great match at the right tail.

If we restrict the sample to the limited set, we find:



Here the power model also provides a decent fit, although it appears to be overfitted to moderate RPG levels more so than the version based on the full dataset.

It should be noted that the regression includes a multiplicative coefficient (.979 for the full dataset) which serves to dampen the effect of the exponent. However, any multiplier other than one will result in a non-one result at one RPG, which means that Smyth’s fundamental insight that led to Pythagenpat is lost. I believe that when I originally came up with Pythagenpat, I simply ignored the multiplicative coefficient from the regression and made no offsetting adjustment.

While neither approach is precise mathematically, another crude option is to modify the exponent to force the x estimate to match the with-coefficient equation at a certain RPG. At the normal 9 RPG level, the full dataset equation above suggests a Pythagorean exponent of 1.863. With a multiplier of 1, you would need the following equation to match that:

x = RPG^.2831

Such a result fits comfortably within our expectation for Pythagenpat.