Monday, January 19, 2009

Runs Per Win from Pythagenpat

I have written about this topic many times before, but you’ll have to bear with me as it is one of my favorites and I like to reexamine it from time to time.

The Pythagenpat method (of which, in the interests of full disclosure, I am a co-developer along with David Smyth) is, at this time, just about the most accurate single formula that seeks to quantify the relationship between runs and wins. The “single formula” qualifier is included to allow for the fact that other approaches may be more accurate, like the Tango Distribution or any other distributions that attempt to model runs per inning or game. However, using the Tango Distribution to describe the runs-wins relationship involves finding a runs/inning distribution, and then converting this into a runs/game distribution, and...suffice it to say, it cannot be reduced to a simple two or three lines of formulas that you can easily plug into a spreadsheet. That’s not a knock on more advanced approaches, just a reality that leads people to use Pythagenpat and other simple formulas.

Digression aside, Pythagenpat is a dynamic winning percentage estimator. Often times, though, runs-wins converters are applied in 100% linear approaches (one such implementation that is used all the time is using linear weights to measure a batter’s contributions, then converting this to Batting Wins), and in such a case you want a simple converter that can convert runs to wins with knowledge of only run differential. Pythagenpat requires you to know runs and runs allowed, while a fixed exponent Pythagorean formula requires you to know the run ratio. If you are converting runs above a baseline to wins, you need a formula that works on the differential, since that is precisely what the comparison to a baseline is.

So sabermetricians have developed a number of formulas that give a generalized RPW value for an average team. The most common is a static 10 runs per win, but there are also many approaches that allow RPW to vary with the total number of runs scored. To serve as an example, one of the most common is Pete Palmer’s formula:

RPW = 10*sqrt(runs per inning), where runs per inning is the total runs for both teams

There are a number of other such formulas out there, and they all do their jobs well enough. However, if you grant for the sake of argument that Pythagenpat is the “best” (understanding its limitations and the qualifier about being a relatively simple formula) W% estimator, then you may be interested in how one can use Pythagenpat to derive such a formula.

First, there is a big assumption that needs to be made. In order to find the RPW value for an average team based at some particular RPG, we need to hold RPG constant. For example, if the RPG is nine, then a run differential (RD) of .1 run/game means that the team scores 4.55 runs and allows 4.45 runs. A RD of 1 run/game would be achieved with 5 runs scored and 4 runs allowed, and so forth.

With that assumption, the Pythagenpat exponent will be a constant, x, which is figured as RPG^z. You’ll see values between .27-.29 used for z, and it is probably true that .28 is a better choice. However, whether you use .27, .29, or something in between will make essentially no difference with respect to the end game of this post.

Standard Pythagenpat estimates W% as:

EW% = R^x/(R^x + RA^x) = RR^x/(1 + RR^x), where RR = run ratio (R/RA)

I may use a lot of calculus in my writing, relative to the average sabermetrician, but I’m not by any means a whiz at it. So the formula for the derivative of EW% with respect to RD (I am using RD to represent run differential per game; (R - RA)/G) that I’m about to print may very well be needlessly complex and easily simplified. Nonetheless:

dEW%/dRD = x*RR^(x-1)/(2*RPG*(RR^x + 1)^2*(.5 - RD/(2*RPG))^2)

This is actually in the form of wins/run; the reciprocal is runs per win, and is:

RPW = ((2*RPG*(RR^x + 1)^2*(.5 - RD/(2*RPG))^2)/(x*RR^(x-1))

We are interested in a generalized formula for RPW that does not depend on the team’s ratio or differential between runs scored and allowed, just the RPG. Therefore, what we’re after is the RPW for an average team at a given RPG. Since the team is average (R = RA), we know that it has a RD of 0 and a RR of 1. Plugging that in:

RPW = ((2*RPG*(1^x + 1)^2*(.5 - 0/(2*RPG))^2)/(x*1^(x - 1))
= ((2*RPG*(2)^2*(.5)^2)/x
= (2*RPG)/x

For a standard Pythagorean equation with x = 2, this simplifies simply to RPW = RPG.

In the case of Pythagenpat, we have set x = RPG^z, and so we can simply further:

RPW = (2*RPG)/(RPG^z) = 2*RPG^(1 - z)

So for z = .29, the generalized RPW, derived directly from Pythagenpat, is 2*RPG^.71, and in order to estimate W% using this equation, you just use the general formula for all RPW estimators, W% = RD/RPW + .5.

None of this is new; all of the above has been previously published in some form or another either by Ralph Caola (the general findings, in his articles in By the Numbers--see Nov/2003, Feb/2004, and May/2004) or by myself (the Pythagenpat application).

Suppose, however, that you’d like to further simplify the relationship between RPW and RPG. You don’t want to have to deal with any exponents and you’re not concerned about whether it works for extreme theoretical situations. You just want a straightforward formula that allows RPW to vary with RPG as you know it should, will be easy to calculate, and will work for normal major league teams.

There are a number of ways you could try to approximate the function above, but one of the easiest is to take the tangent line of the function at a particular point. Since a RPG of 9 is easy to remember and very close to the long-term MLB average, we’ll use that point to find our tangent line.

I’ll write the line in point-slope form, y - y1 = m(x - x1), where y will be RPW, y1 will be RPW at the specific point (RPG = 9), m is the slope of the RPW function at the point RPG = 9, x is RPG, and x1 is the RPG at the point RPG = 9 (9, naturally).

The derivative of 2*RPG^(1 - z) with respect to RPG is (1-z)*2*RPG^(-z) = 2*(1-z)/RPG^z. For z = .29, it is 2*.71/RPG^.29 = 1.42/RPG^.29, which evaluates to .7509 at RPG = 9.

The RPW for a RPG of 9 is 2*9^.71 = 9.5179, and so we can put it all together and get this formula:

RPW - 9.5179 = .7509*(RPG - 9)

Simplifying this and solving for RPW gives:

RPW = .7509*RPG + 2.7598

And since we’re going for simplicity here, why not make sure all the coefficients are multiples of .05?

RPW = .75*RPG + 2.75

Comparing this approximation to 2*RPG^.71, the two are in agreement to within .05 RPW for RPGs between 7 and 11.5. It is within .20 RPW for 5.5-13.5 RPG. Beyond that range, there is a lot of divergence. For example, at the known point RPW = 2 when RPG = 1, the linear approximation gives 3.5. Fortunately, though, 5.5-13.5 RPG encompasses the scoring range that is normally seen from major league teams, and the approximation is fine within those bounds.

So there you have it: a 100% linear winning percentage estimator derived from Pythagenpat (given the assumptions that I’ve made). As I mentioned before, there are a bunch of RPW estimators out there, so it wouldn’t be surprising if this one or something close to it has been published previously. And indeed, that is the case.

Tango Tiger uses the formula 1.5*(RPG + 2) to estimate RPW, except his formula defines RPG as the runs for one team whereas I am using it to represent runs for both teams. So in my terms, his formula is 1.5*(RPG/2 + 2), which simplifies to .75*RPG + 3.

Now you can see why it works--it is a consequence (*) of using Pythagenpat to derive a 100% linear estimator at the normal (9 RPG) major league scoring level, and since Pythagenpat is the “best” W% estimator, any formula you derive from it should be similar to what other approaches like regression would produce.

After I wrote this piece, this topic came up at Fangraphs as they use Tango’s formula. So I checked the 1961-2003 data (excluding 1981 and 1994) and found that the +3 intercept had a slightly lower RMSE in predicting W% (3.949 to 3.951 per 162 games). I was a surprised by this, since the teams in the sample had a mean RPG of 8.74, and the tangent line I took was at RPG = 9. I don’t have an explanation for why this is, but I’ll pass it along anyway.

In order for the tangent line approach to approximating Pythagenpat RPW to yield an intercept of 3, RPG must be 10.3 (with the slope around .72) with a z value of .29. With z = .28, you would need a RPG of 10.72 to get an intercept of 3. This is related to the phenomenon broached in the last paragraph, and I can’t explain it, although I’m not sure it’s something to be concerned about.

Allow me to finish on a digression. Among the W% estimators that can be written as relatively simple formulas, there are two main types: differential estimators and ratio estimators. Of course, the distinction I’m drawing is that the input into differential estimators is run differential and the input into ratio estimators is run ratio.

Within each of those classes, you can break it down further into what I’ll call “dumb” and “smart” methods. Dumb methods used a fixed RPW or a fixed exponent; they assume that the relationship between runs and wins is the same regardless of how high scoring goes. 10 runs = 1 win is a dumb differential estimator; Pythagorean with a fixed exponent like 2 is a dumb ratio estimator.

Smart estimators, of course, change the price of a win as the scoring level changes. Palmer’s formula or Tango’s formula exemplify smart differential estimators, while Pythagenport or Pythagenpat are smart ratio estimators.

I’m not really going anywhere with this except to say that I think it is pretty clear that the smart ratio estimators work better, theoretically, than the smart differential estimators. (As an aisde, a smart differential estimator can definitely be more accurate with normal teams than a dumb ratio estimator. The dumb ratio will win some under extreme conditions since it is bounded by zero and one, but a smart differential estimator can beat it when applied to normal ranges). So, by using a differential method, you are already sacrificing some theoretical accuracy in favor of expediency. So why not simplify things further, and use a 100% linear approach?

RPW = 2*RPG^.71 is linear in a sense, since it is a differential estimator. It values each additional run equally; you fix the RPG to being with, and as your differential changes (but the total stays the same), each run you gain is worth 1/(2*RPG^.71) wins.

It is not, of course, a purely linear function, since it has an exponent. And my point is, why bother? It’s nice to have that formula around to answer specific questions, but if you are ever going to apply it generally, you should either 1) just go ahead and use Pythagenpat or 2) use something simpler. And that is why .75*RPG + 2.75 or 3 is a nice little formula to have around. That it can be semi-derived from Pythagenpat? All the better.

Technical addendum: If you want a general formula for the tangent line so that you can try it with different values of z and RPG, here it is:

pRPW (pythpat RPW) = 2*RPG^(1 - z)
m = 2*(1-z)/RPG^z
b = pRPW - m*RPG
lRPW (linear RPW) = m*RPG + b

(*) Of course, that’s a convoluted way of looking at it--all of these equations are the result of attempting to model the reality of baseball, and are a consequence of that, not each other. However, if you start from the premise that Pythagenpat is the best model (which can certainly be debated), and proceed from there to find a linear estimator of RPW, .75*RPG + b is where you end up.

6 comments:

  1. I apologize in advance for being dense, but:

    1. I understand that .75 * RPG + 2.75 is the simplified linear model, but

    2. If I want a slightly 'more advanced' model, then the answer would be to use 2 * RPG ^ .71? I've actually been using 2 * RPG ^ .72, so that is somewhat comforting.

    3. I'm confused about what formula would be the 'next' level of more accuracy/less simplicity after 2*RPG^.71. Could you give that formula and walk thru an example for RPG = 2?

    ReplyDelete
  2. No problem.

    2. Yes, 2*RPG^z is the more advanced model. There's nothing wrong with using .72, so I wouldn't go changing any formulas if I was you.

    3. There isn't a next level, in terms of a runs per win formula. The next level in sophistication would be to switch from a "smart" differential model like 2*RPG^.72 to a "smart" ratio model like Pythagenpat. And for a large number of questions, that's not practical and not helpful.

    ReplyDelete
  3. Do you know the RMSE of Palmer's way of converting runs to wins?

    ReplyDelete
  4. For the sample used in the post, it is 3.974, with the caveat that I am using RPG/9 rather than (R+RA)/inning as the formula calls for.

    ReplyDelete
  5. So I can use the equation RPW = 2*RPG^.71 for any run enviornment and any era?

    You say: "You'll see values between .27-.29 (.71 to .73 for RPW) used for z, and it is probably true that .28 is a better choice."

    Why is .28 a better choice? If it is a better choice, should I use .72 for z?

    ReplyDelete
  6. So I can use the equation RPW = 2*RPG^.71 for any run environment and any era?To the extent you'd be comfortable using Pythagenpat for that environment, yes.

    Why is .28 a better choice?Tango ran some tests (years ago, when we first published Pythagenpat) and found that it was a middle-ground between matching the theoretical results from the Tango Distribution and the real-world data, IIRC. But .28, .29, whatever, is not a huge deal.

    If it is a better choice, should I use .72 for z?Yes. It'll be one minus whatever exponent you use.

    ReplyDelete

I reserve the right to reject any comment for any reason.