I had intended the last installment to be the end of this series, but Tom Tango left a comment on pt. 3 that led me down a rabbit hole. It’s of the frustrating variety, as I can’t figure out how to dig back to the surface and exploring it hasn’t led me to learn anything useful or interesting about baseball. Nevertheless, I find it interesting as a purely mathematical exercise and worth a brief post.

Tango pointed
out that he had proposed some time ago the simple formula:

RPW = .75*RPG +
3 (Tango’s version was originally expressed as 1.5*RPG + 3 because he was
defining RPG as the average for one team; I’ll keep with my definition here for
consistency with the rest of the series)

I was aware of
this formula and have mentioned it on this blog before, but it slipped my mind
when writing these posts. You may recall from pt.3 that I offered the formula:

RPW = .777*RPG +
2.694

A brief reminder
of how this was derived – I started by differentiating the Pythagenpat formula
for a fixed z value of .282 with respect to run differential, and then plugging
in the appropriate values for a .500 team to get RPW = 2*RPG^(1 – .282). Then I
differentiated this formula with respect to RPG, and found the y = mx + b
formula that would follow if you assumed a .500 team with the average of RPG
and RPW of the 1961 – 2019 major leagues.

Of course, these
formulas both take the form y = mx + b, where y is the estimated RPW and x is
the team’s RPG. My formula has a higher slope, but a lower intercept. At 9 RPG
for a team with a run differential of one per game, mine would estimate 97.72
wins for a team and Tango’s 97.62. This doesn’t seem like a lot, and in the
grand scheme of things it isn’t, but if this kind of difference didn’t interest
me than this blog wouldn’t exist.

Using the
1961-2019 data, and scaling the RMSE to 162 games, Tango’s formula has a RMSE
of 4.0348, and mine 4.0370. Pythagenpat itself (z = 2.82) checks in at 4.0345,
which is interesting – my RPW formula performs worse than Tango’s, but is
derived directly from Pythagenpat, which performs better. Also interesting –
that with real major league teams, Tango’s formula is about as accurate as you
can get despite being very simple (relative to full-blown Pythagenpat) and
having rounded coefficients.

Note, I’m emphasizing
RMSE with real teams in this discussion because if you want theoretical
accuracy over a wide range of possible team R/RA combinations, you’d just use
Pythagenpat and be done with it. If you’re using a simplification that isn’t as
accurate as an equally simple formula for the application you’ll most use it
for, what’s the point?

My first thought
as to why Tango’s formula had a lower RMSE than mine was that I had
over-flattened the whole thing and was thus missing something. This series
starts from the premise that Pythagenpat is the right model for win estimation,
and then simplifies from there, often centering at the point of a team that
scores and allows the same number of runs, in an average scoring context. But
the teams in the sample data, while by definition centered there, vary in both
axes (R/RA and RPG). Perhaps the linear approximation to the Pythagorean RPW
for a .500 team misses some subtle change in the slope or intercept caused by
this variation, and you could do better by running a regression on all the
individual datapoints rather than using the single point estimate to derive the
formula.

So I calculated
the actual Pythagenpat RPW for all team (i.e. the value for RPW which when
applied will estimate that the team’s W% will be equal to its Pythagenpat W%),
which from pt.3 is:

RPW = (R –
RA)/(R^x/(R^x + RA^x) - .5)

Where x is the Pythagenpat exponent corresponding to each team’s RPG

This is
undefined when R = RA, but also from pt. 3, we can fill this gap with the
calculus-derived formula for a team with R = RA:

RPW = 2*RPG/x

Having
calculated the actual Pythagenpat RPW for all teams, we can run a linear
regression with RPG as the independent variable to get an alternative formula,
which winds up being:

RPW = .7818*RPG
+ 2.6823

Which is
reasonably close to my formula (and thus an argument in favor of “centering”
being a reasonable approach), but takes the slope higher and the intercept
lower – in other words, moving away from Tango rather than closing the gap as
we might have hoped/expected. This formula has a RMSE of 4.0364, still worse
than Tango’s although better than mine.

At this point,
the logical question is how far can we push the slope down and the intercept up
to minimize RMSE? According to Excel solver, quite far:

RPW = .6528*RPG
+ 3.8760

This is a huge
difference even from Tango’s formula, with the slope 13% lower and the
intercept 29% higher. RMSE = 4.0334, ever so slightly lower than even
Pythagenpat.

Why can we
improve the accuracy of our W% estimate (at least working with this sample of
the last sixty years of MLB), even while getting farther away from the RPW
relationship suggested by Pythagenpat? Unfortunately, I don’t have a satisfying
answer to that question. It’s tempting to say that we are losing something by
eliminating the team’s quality (e.g. the difference and/or ratio between their
runs and runs allowed), which Pythagenpat considers in addition to the level of
run scoring (RPG). Of course, the best-fit cares not about quality either, and
I don’t have a compelling explanation for why lowering the slope and raising
the intercept would be related to that.

Big baseball analytics fan, similarly-oriented math guy, occasional reader of your blog...

ReplyDeleteAnother way to view... If I say RPW = 2*R^alpha, what values do alpha and reference R need to take on in order to exactly match Tango's RPW = 3/4 * R + 3

General slope-intercept formula is

RPW_lin = 2/Rref^alpha * [alpha*Rref + (1-alpha)*R]

and the values that match tango's are alpha = 0.279877 and Rref = 10.292001