Monday, November 19, 2007

Tangent Lines and Bill Kross

This is a math post with little baseball content and no baseball insight, so be forewarned.

In calculus, at least as far as I understand it, the tangent line is a line that intersects a point on a curve in the same direction as the curve, and the line has the same slope as exists on the curve at the point. That’s the best I can do--see this Wikipedia article for a better description.

Anyway, the tangent line is linear (it can be written as y = mx + b), and it shares the same slope as the line that it intersects. That means that near the point in question, it is just about the best linear approximation that you can get.

Where this ties into baseball is that if we have a non-linear function and want a linear approximation to it, the tangent line can be a shortcut that is easier and quicker than generating a line through some other technique (such as regression). Understanding how the tangent line works can also help us understand why non-linear baseball models have the linear approximations that they do.

First, let’s calculate a tangent line for a non-baseball problem. Suppose we have the line z = x^3, and we want a tangent line at the point x = 3. At x = 3, z = 3^3 = 27. The slope at x =3 can be found by first taking the derivative of z, which is z’ = 3x^2, so z’(3) = 3(3)^2 = 27.

We can write the line in the point-slope format as y - y1 = m(x - x1), where y1 and x1 are the base (x,y) point and m is the slope. So y - 27 = 27(x - 3). We can convert this to the common y = mx + b form to get y = 27x - 54.

At x =3, y = 27(3) - 54 = 27, which is exactly equal to z, as we know it should be. If we look at another x value close to 3, say 3.1, we get z = 29.791. We get y = 29.7. As you can see, they are pretty close. As we get further away, the linear approximation will perform worse, especially for functions with a steep slope.

Now, let’s talk about some of the baseball relationships where this is applicable. Clay Davenport used to publish a team version of EQR in which (RAW/LgRAW)^2 approximated the percentage to which the team R/PA exceeded the league average. There is also a linear version (which is the only one I have seen Clay publish in some time), in which the mapping is 2*(RAW/LgRAW) - 1.

Let’s call RAW/LgRAW “ARAW” for adjusted RAW. The two relationships we have are ARAW^2 and 2*ARAW - 1. Now suppose we work with the exponential function and find the tangent line at the league average point, where ARAW = 1 and the result of the formula = 1 (this is common sense, as a team with a RAW equal to the league average should score runs at a rate equal to the league average). The slope of ARAW^2 is 2*ARAW, which is 2*1 = 2 when ARAW = 1. So y - 1 = 2*(ARAW - 1), and y = 2*ARAW - 1. As you can see, that is the other Davenport
formula.

This is no surprise, as even if Davenport derived the relationship through a regression approach, we would expect the best fit to be about the same as the point at the league average, since most of the teams are tightly clustered around that point.

Another stat which follows the same relationship to runs is OPS. David Smyth (and perhaps others, but I recall seeing David write it) has pointed out that the square of relative OPS (not OPS+, but straight OPS/LgOPS) tracks runs, and Steve Mann wrote about the similar 2*(OPS/LgOPS) - 1 relationship eighteen years ago in The Baseball Superstats 1989.

The most interesting relationship, though, is the Pythagorean win estimator. I have written about this before on my website. Pyth can be written as:

WR = RR^z

Where WR is the win ratio (W/L), RR is the run ratio (R/RA), and z is the exponent (usually seen as z = 2). We know that for an average team, RR = WR = 1. The slope of the function is z*RR^(z - 1). If z =2, then it is just 2*RR, which is 2 when RR = 1. If z = 1.83 (another common value), than it would be 1.83*RR^.83, which is 1.83*RR when RR = 1.

We know that W% = WR/(WR + 1). We can therefore write this as a W% estimator as W% = (2*RR - 1)/(2*RR).

This method of estimating W% was discussed, informally, by Bill James in the 1984 Baseball Abstract. James said that if a team scored 10% more runs than their opponents, they should win 20% more games. He wrote that he had never tried it but it “should work”, and dubbed it “Double the Edge”. I have no idea whether Bill came up with this through similar mathematical logic to what you see here, or whether it was intuitive. With James, I’d believe either.

Anyway, the good thing about this estimator is that it caps W% at 1. However, it does not bottom out at zero--a RR of less than .5 results in a negative W%.

Ralph Caola, who has done a lot of work on run to win converters, emailed me after reading the article on my site and suggested that to solve this problem, one could use two equations: one when Run Ratio is greater than one, and one when Run Ratio is less than one. For the less than case, you could define W% as 1 - (2*OppRR - 1)/(2*OppRR), where OppRR is the opponents’ run ratio, RA/R. This way, reciprocal run ratios would produce complementary W%s, as we would intuitively expect (and as Pythagorean gives).

This way, reciprocal run ratios would produce complementary W%s, as we would intuitively expect (and as Pythagorean gives).

There are dozens of ways you can write those formulas, and Ralph settled on W% = (R-RA)/(R + RA + ABS(R-RA)) + .5.

And sure enough, the equation is more accurate and more theoretically sound if you use Caola’s insight. However, I have recently realized that Ralph was not the first one to uncover this formula. In fact, it has been in the public eye for over twenty years and little has been said about it. (I am not necessarily bemoaning this, because the only reason to use the linear approximations to Pythagorean is simplicity. They are not preferable. However, with the increased presence of sabermetric research all over the place, I am a bit surprised that Ralph and I seem to have been the only ones to play around with James’ Double the Edge).

In The Hidden Game of Baseball, there is a brief description of several run to win methods in Chapter 4. In a footnote, Palmer/Thorn write “About a year after Pete’s article [in SABR’s The National Pastime] appeared, Bill Kross, a Purdue professor, devised an elegant little formula that was not only simpler than the others, but also very nearly as accurate, erring only when run differentials were extreme (+/- 200 runs). If a team is outscored by its opponents, Kross predicts its winning percentage by dividing runs scored by two time runs allowed; if a team outscores its opponents, the formula becomes, 1 - RA/(2*R).”

Remember what I said about there being dozens of different ways to write the DTE formula? I am not going to go through the algebra here, but suffice it to say that the Kross formulas are one of the dozens. I don’t know if Mr. Kross developed those by linearizing the Pythagorean formula, or through some other technique, but there it is. These formulas are not a breakthrough in accuracy, be it empirical or theoretical, but they are quick and easy and do have a strong logical foundation, and can even be seen as offshoots of Pythagorean estimators.

4 comments:

  1. A tangent is a line that intersects a curve at one and only one point.

    ReplyDelete
  2. RAW is the rate stat that feeds into the Equivalent Runs formula. See http://baseballprospectus.com/glossary/index.php?mode=viewstat&stat=146

    ReplyDelete

I reserve the right to reject any comment for any reason.