Wednesday, February 17, 2021

Akousmatikoi Win Estimators, pt. 5: Notes on Linear RPW Estimators

I had intended the last installment to be the end of this series, but Tom Tango left a comment on pt. 3 that led me down a rabbit hole. It’s of the frustrating variety, as I can’t figure out how to dig back to the surface and exploring it hasn’t led me to learn anything useful or interesting about baseball. Nevertheless, I find it interesting as a purely mathematical exercise and worth a brief post.

Tango pointed out that he had proposed some time ago the simple formula:

RPW = .75*RPG + 3 (Tango’s version was originally expressed as 1.5*RPG + 3 because he was defining RPG as the average for one team; I’ll keep with my definition here for consistency with the rest of the series)

I was aware of this formula and have mentioned it on this blog before, but it slipped my mind when writing these posts. You may recall from pt.3 that I offered the formula:

RPW = .777*RPG + 2.694

A brief reminder of how this was derived – I started by differentiating the Pythagenpat formula for a fixed z value of .282 with respect to run differential, and then plugging in the appropriate values for a .500 team to get RPW = 2*RPG^(1 – .282). Then I differentiated this formula with respect to RPG, and found the y = mx + b formula that would follow if you assumed a .500 team with the average of RPG and RPW of the 1961 – 2019 major leagues.

Of course, these formulas both take the form y = mx + b, where y is the estimated RPW and x is the team’s RPG. My formula has a higher slope, but a lower intercept. At 9 RPG for a team with a run differential of one per game, mine would estimate 97.72 wins for a team and Tango’s 97.62. This doesn’t seem like a lot, and in the grand scheme of things it isn’t, but if this kind of difference didn’t interest me than this blog wouldn’t exist.

Using the 1961-2019 data, and scaling the RMSE to 162 games, Tango’s formula has a RMSE of 4.0348, and mine 4.0370. Pythagenpat itself (z = 2.82) checks in at 4.0345, which is interesting – my RPW formula performs worse than Tango’s, but is derived directly from Pythagenpat, which performs better. Also interesting – that with real major league teams, Tango’s formula is about as accurate as you can get despite being very simple (relative to full-blown Pythagenpat) and having rounded coefficients.

Note, I’m emphasizing RMSE with real teams in this discussion because if you want theoretical accuracy over a wide range of possible team R/RA combinations, you’d just use Pythagenpat and be done with it. If you’re using a simplification that isn’t as accurate as an equally simple formula for the application you’ll most use it for, what’s the point?

My first thought as to why Tango’s formula had a lower RMSE than mine was that I had over-flattened the whole thing and was thus missing something. This series starts from the premise that Pythagenpat is the right model for win estimation, and then simplifies from there, often centering at the point of a team that scores and allows the same number of runs, in an average scoring context. But the teams in the sample data, while by definition centered there, vary in both axes (R/RA and RPG). Perhaps the linear approximation to the Pythagorean RPW for a .500 team misses some subtle change in the slope or intercept caused by this variation, and you could do better by running a regression on all the individual datapoints rather than using the single point estimate to derive the formula.

So I calculated the actual Pythagenpat RPW for all team (i.e. the value for RPW which when applied will estimate that the team’s W% will be equal to its Pythagenpat W%), which from pt.3 is:

RPW = (R – RA)/(R^x/(R^x + RA^x) - .5)

Where x is the Pythagenpat exponent corresponding to each team’s RPG

This is undefined when R = RA, but also from pt. 3, we can fill this gap with the calculus-derived formula for a team with R = RA:

RPW = 2*RPG/x

Having calculated the actual Pythagenpat RPW for all teams, we can run a linear regression with RPG as the independent variable to get an alternative formula, which winds up being:

RPW = .7818*RPG + 2.6823

Which is reasonably close to my formula (and thus an argument in favor of “centering” being a reasonable approach), but takes the slope higher and the intercept lower – in other words, moving away from Tango rather than closing the gap as we might have hoped/expected. This formula has a RMSE of 4.0364, still worse than Tango’s although better than mine.

At this point, the logical question is how far can we push the slope down and the intercept up to minimize RMSE? According to Excel solver, quite far:

RPW = .6528*RPG + 3.8760

This is a huge difference even from Tango’s formula, with the slope 13% lower and the intercept 29% higher. RMSE = 4.0334, ever so slightly lower than even Pythagenpat.

Why can we improve the accuracy of our W% estimate (at least working with this sample of the last sixty years of MLB), even while getting farther away from the RPW relationship suggested by Pythagenpat? Unfortunately, I don’t have a satisfying answer to that question. It’s tempting to say that we are losing something by eliminating the team’s quality (e.g. the difference and/or ratio between their runs and runs allowed), which Pythagenpat considers in addition to the level of run scoring (RPG). Of course, the best-fit cares not about quality either, and I don’t have a compelling explanation for why lowering the slope and raising the intercept would be related to that.

1 comment:

  1. Big baseball analytics fan, similarly-oriented math guy, occasional reader of your blog...

    Another way to view... If I say RPW = 2*R^alpha, what values do alpha and reference R need to take on in order to exactly match Tango's RPW = 3/4 * R + 3

    General slope-intercept formula is

    RPW_lin = 2/Rref^alpha * [alpha*Rref + (1-alpha)*R]

    and the values that match tango's are alpha = 0.279877 and Rref = 10.292001

    ReplyDelete

I reserve the right to reject any comment for any reason.