Wednesday, January 20, 2021

Akousmatikoi Win Estimators, pt. 3: Differential-Based Simplifications

Simplifying the Pythagorean estimate by focusing on run differential is not as intuitive as using run ratio, since of course Pythagorean constructs are based on the latter rather than the former. The upfront calculus is messier, the relationships harder to explain – I’ve covered all this before, and so I went back to my previous work rather than go through the hassle of re-deriving it. However, while the calculus is messier, the end result is simpler, and give you relationships that you might actually choose to use in place of the full Pythagorean treatment if you want something quick and simple to punch into a calculator.

The easiest way I’ve found to demonstrate this approach (which is not to say that a simpler derivation doesn’t exist) is to use the following definitions. To make this easier to follow, I’m going to define R as R/G and RA as RA/G:

RR = R/RA

RD = R - RA

RPG = R + RA

Given these relationships, we can relate run ratio and run differential using RPG:

RR = (RD + RPG)/(RPG – RD)

If you need a proof of that, replace RD and RPG with the equations above and you will see that:

RR = (R – RA + R + RA)/(R + RA – (R – RA)) = (2*R)/(2*RA) = R/RA

In the last installment, we differentiated Pythagorean win ratio with respect to run ratio; here, I want to differentiate Pythagorean winning % with respect to run ratio, which will look slightly messier. Starting from the Pythagorean relationship:

W% = RR^x/(RR^x + 1)

we differentiate to get:

dW%/dRR = ((RR^x + 1)*(x*RR^(x – 1)) – RR^x*(x*RR^(x – 1)))/(RR^x + 1)^2

= (x*RR^(x – 1))*((RR^x + 1) – RR^x)/(RR^x + 1)^2

dW%/dRR = x*RR^(x – 1)/(RR^x + 1)^2

That’s well and good, but it doesn’t tell us anything about the relationship between Pythagorean W% and run differential. To bridge that gap, we can differentiate run ratio with respect to run differential and multiplying this result with dW%/dRR which we just derived:

(dW%/dRR)*(dRR/dRD) = dW%/dRD

Since we know that RR = (RD + RPG)/(RPG – RD), we get:

dRR/dRD = ((RPG – RD)*1 – (RD + RPG)*(-1))/(RPG – RD)^2

= 2*RPG/(RPG – RD)^2

If you slogged through any of my previous treatments of this topic, I must apologize – I missed some simplifications of both of these formulas before. The final math worked out the same, but it was needlessly difficult to follow. In any event, we now have:

dW%/dRD = (x*RR^(x – 1)/(RR^x + 1)^2) * (2*RPG/(RPG – RD)^2)

= 2*RPG*x*RR^(x -1)/((RR^x + 1)^2*(RPG – RD)^2)

This ends up being expressed in terms of marginal wins per margin run. The classic sabermetric presentation is marginal runs per margin win (Runs Per Win, ala the rule of thumb that 10 runs = 1 win). So we can take the reciprocal to get this formula for Runs Per Win from Pythagorean:

Pythagorean RPW = (RR^x + 1)^2*(RPG – RD)^2/(2*RPG*x*RR^(x - 1))

Before moving forward, one thing I should note is that this function does not allow us to match the Pythagenpat W% at a given point for a set of inputs. For example, if you plug in 5 runs scored and 4 runs allowed, you will get a dW%/dRD of .1071. You might then reasonably assume that if you take the team’s run differential of 1 times .1071 plus a y-intercept (which by definition would be .5 since Pythagorean will estimate a .500 W% when R = RA), you will get a restatement of the team’s Pythagorean W%. But in fact you will get .6071, while Pythagorean would estimate 5^2/(5^2 + 4^2) = .6098. The differences will be more extreme if you put in more extreme teams.

Alas, I do not have a simple mathematical explanation for why this is the case. However, I will note that we don’t need calculus to calculate the actual Runs Per Win value from Pythagorean for any given set of R, RA, and x that we input. We can simply calculate this by noting that:

W% = RD/RPW + .5

Plugging in Pythagorean relationships and solving for RPW:

R^x/(R^x + RA^x) = RD/RPW + .5

R^x/(R^x + RA^x) - .5 = RD/RPW

RPW = (R – RA)/(R^x/(R^x + RA^x) - .5)

For our 5 R/4 RA team, this results in (5 – 4)/(5^2/(5^2 + 4^2) - .5) = 9.1111 RPW or .1098 wins/run, which of course is the right answer. In terms of simplifying the Pythagorean relationship, though, this is useless – all we’ve done is rearrange terms to calculate runs per win for a given set of inputs. How we could use this to produce a flatter win estimator is to eliminate the use of a team’s R and RA figures and instead replace with a function that only considers the scoring level (i.e. RPG).

This is what the rule of thumb that 10 runs = 1 win does, substituting a general rule for specifics about the team’s actual location on a run/win curve with respect to the marginal value of an additional run scored or allowed. As such, since it’s establishing a rule that will be applied to all teams, it makes sense to center it at the point which will be closest to an average team – at the point where R = RA.

In other words, we will be developing a RPW equation that can be applied generally, but will be defined based on the relationship at the point where R = RA for a given RPG. Using our formula above for RPW based on rearrangement of terms in the Pythagenpat relationship, we can substitute R = RA wherever we see one of those terms and...reduce the equation to 0/0, as the denominator R – RA equals 0 when R = RA, and the numerator R^x/(R^x + RA^x) - .5 = 0 when R = RA.

However, this is where the equation for RPW derived using calculus can step in, and tell us what the theoretical RPW value is at that point. Recall from above that:

RPW = (RR^x + 1)^2*(RPG – RD)^2/(2*RPG*x*RR^(x – 1))

If we assume that R = RA, then RR = 1 and RD = 0, and this simplifies nicely to:

RPW = (1^x + 1)^2*(RPG – 0)^2/(2*RPG*x*1^(x – 1))

= 2^2*RPG^2/(2*RPG*x) = 4*RPG^2/(2*RPG*x) = 2*RPG/x

The first immediate implication is that for our special Pythagorean case where x = 2, RPW = RPG. Since the general case is:

W% = (R/G – RA/G)/RPW + .5

RPW = RPG is equivalent to saying that (after all of the game denominators cancel out):

W% = (R – RA)/(R + RA) + .5

What if x is a constant other than 2, like the value of x = 1.847 that minimizes RMSE for expansion-era major league teams? Then RPW = 2*RPG/1.847 = 1.083*RPG, and we could say that:

W% = (R/G – RA/G)/(1.083*(R/G + RA/G)) + .5

= (1/1.083)*(R/G – RA/G)/(R/G + RA/G) + .5

= .923*(R – RA)/(R + RA) + .5

More generally:

W% = (x/2)*(R – RA)/(R + RA) + .5

This form is one that was proposed by Ben Vollmayr-Lee as .91*(R – RA)/(R + RA) + .5 (I’ve rewritten his formula to match the format I’m using), which would imply a Pythagorean x = 1.82. I would suggest that the Kross equations and the Vollmayr-Lee equation are the ultimate in terms of simplified win estimators from the Akousmatikoi family (again, Kross and Vollmayr-Lee did not start from Pythagorean as we have; by including these estimators in the Akousmatikoi family, I only mean to suggest that they are mathematically related to Pythagorean, not that their creators didn’t independently discover them).

Remember that for the expansion era, the average RPG is 8.83, which would imply that the long-term RPW value is approximately 1.083*8.83 = 9.56; close enough to ten that you can see why we might have a rule of thumb, although ten runs would imply a 4.5% higher scoring context (10/1.083 = 9.23) than observed in the expansion era.

We could also use a hybrid approach, in which we allow each team’s RPW according to the formula that applies when R = RA to vary based on their RPG, but not on how that RPG breaks down into runs scored and allowed. In order to do this, we’d return to RPW = 2*RPG/x, but instead of setting x equal to a constant, use a custom value for x. Of course, my suggested value would be the Pythagenpat estimate of x, namely:

x = RPG^z, where z = .282 for now (value that minimizes RMSE for the expansion era)

Substituting this equation for x, we find a general case for a variable z that:

RPW = 2*RPG/(RPG^z) = 2*RPG^(1 – z)

Or for the specific case that z = .282:

RPW = 2*RPG^.718

We could further flatten this equation by approximating it with a linear function. Recall from the last section that we can write a tangent line in the form:

y – y1 = m(x – x1) where x1 and y1 and the x and y values for the point in question, and m is the slope of the curve at x1.

To apply this approach to develop a linear approximation of the above equation, we first need the slope of the RPW function 2*RPG^(1 – z). Differentiating with respect to RPG yields 2*(1 – z)*RPG^(-z).

Let’s center this at the point corresponding to our expansion-era averages, so x = 1.847 (For the eagle-eyed readers or those checking my math (always welcomed!) I’m choosing to use the value that minimizes RMSE to be consistent with earlier applications rather than the value of 1.848 that corresponds to 8.83 RPG using the equation directly). In this case x1 will be 8.83 RPG, and y1 = 2*8.83^.718 = 9.555 At 8.83 RPG, m will be 2*(1 - .282)*8.83^(-.282) = .777, so we have:

RPW – 9.555 = .777*(RPG – 8.83)

which simplifies to:

RPW = .777*RPG + 2.694

We’ve now developed two RPW estimates, using only RPG as a dependent variable, one with a y-intercept and one without, by trying to flatten the Pythagorean relationships wherever possible. Which is more accurate? One would assume that it’s the version with y-intercept, but even if it is, how much more accurate for normal teams, and how does this tangent line based approach compare with the best fit for an equation of the form RPW = m*RPG + b? Those are questions we’ll explore in the final installment.

References

Ben Vollmayr-Lee’s article on win estimation formulas:

http://www.eg.bucknell.edu/~bvollmay/baseball/pythagoras.html

Ralph Caola published multiple articles on using differentiation with the Pythagorean formula, as well as an (to the best of my knowledge) unpublished article he shared with me on double the edge.

His articles can be found in the 11/2003, 2/2004, and 5/2004 issues of By the Numbers.

https://sabr.org/research/statistical-analysis-research-committee-newsletters/

Kevin D. Dayaratna and Steven J. Miller explored the relationship that RPW = 2*RPG/x in the 5/2012 issue of BTN. I had known and used that one for a long time, thanks originally to a post by David Glass on rec.sport.baseball. Unfortunately a quick search did not yield a live link to Glass’ post.

Wednesday, January 06, 2021

Akousmatikoi Win Estimators, pt. 2: Ratio-Based Simplifications

We will begin our endeavor to simplify/”flatten” the Pythagenpat exponent by looking at approaches that maintain the use of run ratio as the chief independent variable in the W% estimate. Before jumping into that, I should note that we could think of the first flattening as being moving from a variable exponent like Pythagenport/pat to a fixed exponent. However, since the latter came first historically, and is easier to explain conceptually, I didn’t approach it in that manner.

We could also make flattening the Pythagenpat exponent itself the first step. My definition of “flatten” for the sake of this discussion is to replace exponents with multiplication where possible. We could start by trying to convert z = RPG^.282 into a linear formula. I’ve skipped this step because we would still be left with exponents when we go to calculate the winning percentage. While simplifying the equations will generally cost us some theoretical and a tiny bit of empirical accuracy, it will gain us ease of calculation. Replacing RPG^.282 with a linear equation wouldn’t really make the calculation any easier, but more importantly I don’t think it would result in an interesting alternative methodology to estimate W%. It would just result in a very slightly easier to calculate, less accurate Pythagenpat equation.

I previously wrote the general Pythagorean relationship as:

W% = R^x/(R^x + RA^x)

but note that we could equivalently define win ratio (W/L = WR) as:

WR = RR^x where RR = run ratio = R/RA

I will alternate between these two ways of writing the equation depending on whichever is most convenient for what we’re trying to do. In this case, I want to see what happens if we get rid of the exponent. The approach I will take is to replace the current function with a simplified function that produces the same result for a particular point. Of course we cannot replace the function with another that will produce the same results at all points, or even expect to find one that would produce the same results at multiple points. But we will be able to find a function that produces the same result at a given point.

Mathematically, this will the tangent line to the curve at that point. At that point, the tangent line intersects the curve and has the same slope as the curve. We will determine the slope by differentiating the function, and we will then determine the tangent line using the point-slope equation for the line as a starting point (to me, this is the most intuitive way to write the equation of a line, and if necessary we can simplify later). The point-slope equation of a line is:

y – y1 = m(x – x1)

where x1 and y1 and the x and y values for the point in question, and m is the slope of the curve at x1.

I’m going to switch to referring to the Pythagorean exponent as “a”, so that it doesn’t get confused with x, our independent variable (which is run ratio). So if we want the tangent line for the equation WR = RR^a, we first differentiate with respect to run ratio to get:

dWR/dRR = a*RR^(a – 1)

Now we just need to determine x1 and y1. Since we are going to be applying simplified win estimation formulas across the entire spectrum of possible team performance, it makes the most sense to look at a team with R = RA, that we expect to have a .500 W%. Picking the average will likely result in the most accurate simplified equation over the entire spectrum of teams.

Of course, by simplifying the equation, we will lose accuracy (at least when the result of our simplified equation is compared to the “parent” equation – we hope in this case that the Pythagorean form is more accurate or else the entire premise of Akousmatikoi win estimators is moot). However, the simplified equation will match the parent equation precisely at chosen point, and will produce very similar results near the chosen point, so picking a point in the center of the distribution should maximize accuracy.

So, if R = RR, then RR equals one, and so our slope is simply equal to a, which is the Pythagorean exponent. Our x value is RR, which is 1, and our y is the WR corresponding to a RR of 1, which is 1 for any value of a as WR = RR^a. So in point slope form:

y – 1 = a*(x – 1)

which can simplify to

y – 1 = a*x – a

y = a*x – a + 1

Remembering what y and x represent in this case:

WR = a*RR – a + 1

For a fixed Pythagorean exponent a = 2:

WR = 2*RR – 2 + 1 = 2*RR - 1

This relationship suggests that if a team scores 10% more runs than it allows, it should win 20% more games than it loses. In the 1984 Baseball Abstract, Bill James wrote:

Another method that I have never tested but which I suspect would work as well as the others would be just to “double the edge”; that is, if a team scores 10% more runs than their opponents, they should win 20% more games than their opponents. If they score 1% more runs, they should win 2% more games. That method would probably work as well or better than the Pythagorean approach.

To my knowledge that’s the extent of James’ writings on this subject, so I can’t say whether he either explicitly or implicitly inferred “double the edge” from the Pythagorean formula, or whether he came across it some other way. Either way, it can be directly related back to his own Pythagorean method.

If WR = a*RR – a + 1, and we already know that by definition W% = WR/(WR + 1), then we can convert this into a W% estimate as:

W% = (a*RR – a + 1)/(a*RR – a + 1 + 1) = (a*RR – a + 1)/(a*RR – a + 2)

For the special case of a = 2, this becomes:

W% = (2*RR – 2 + 1)/(2*RR – 2 + 2) = (2*RR – 1)/(2*RR) = 1 – 1/(2*RR) = 1 – 1/(2*R/RA)

= 1 - RA/(2*R)

This special case was noted by Bill Kross, and got a brief callout in The Hidden Game of Baseball. Kross also noticed that this method would not produce the same result for teams that had inverse runs and runs allowed. A team that scores 5 and allows 4 runs would have an estimated W% of 1 - 4/(2*5) = .600, but a team that scores 4 and allows 5 would have an estimated W% of 1 – 5/(2*4) = .375.

So Kross proposed that that for the case in which runs scored < run allowed, the W% would be estimated as R/(2*RA), which would produce 4/(2*5) = .400 for the the team scoring 4/allowing 5. Not only is it satisfying to get a consistent result for the two sides of the same coin, this modification significantly improves the accuracy when comparing empirically comparing estimated to actual W%s.

Expressing this inversion in terms of the general case above, in a case where R < RA, the estimated WR would be:

WR = 1/(a*1/RR – a + 1) = 1/(a/RR – a + 1)

and the W% would be:

W% = 1/(a/RR – a + 1)/(1/(a/RR – a + 1) + 1)

There are some ways to make that look nicer, but I don’t think any of them are sufficiently nice to bother with here. For the specific case when a = 2, Ralph Caola has suggested this formula as a clean way to boil the Kross equations down to one line:

W% = (R - RA)/(R + RA + ABS(R - RA)) + .5

You might be reading this and objecting “I thought you were going to simplify the Pythagorean relationship, but nothing about the equation with all of those reciprocals above looks simpler”. That is true – other than the special case when a = 2 and the Kross equations apply, this is not an easier way to calculate an estimated winning percentage provided you have a modern calculator or computer. However, it is “simpler” mathematically in the sense that we have eliminated exponents. Of course, in so doing we have lost some accuracy, particular for extreme cases. Next time, instead of starting with run ratio, we’ll start with run differential and see what shakes out of Pythagorean and how it compares to methods that have been developed independently of Pythagorean.