Monday, November 19, 2007

Tangent Lines and Bill Kross

This is a math post with little baseball content and no baseball insight, so be forewarned.

In calculus, at least as far as I understand it, the tangent line is a line that intersects a point on a curve in the same direction as the curve, and the line has the same slope as exists on the curve at the point. That’s the best I can do--see this Wikipedia article for a better description.

Anyway, the tangent line is linear (it can be written as y = mx + b), and it shares the same slope as the line that it intersects. That means that near the point in question, it is just about the best linear approximation that you can get.

Where this ties into baseball is that if we have a non-linear function and want a linear approximation to it, the tangent line can be a shortcut that is easier and quicker than generating a line through some other technique (such as regression). Understanding how the tangent line works can also help us understand why non-linear baseball models have the linear approximations that they do.

First, let’s calculate a tangent line for a non-baseball problem. Suppose we have the line z = x^3, and we want a tangent line at the point x = 3. At x = 3, z = 3^3 = 27. The slope at x =3 can be found by first taking the derivative of z, which is z’ = 3x^2, so z’(3) = 3(3)^2 = 27.

We can write the line in the point-slope format as y - y1 = m(x - x1), where y1 and x1 are the base (x,y) point and m is the slope. So y - 27 = 27(x - 3). We can convert this to the common y = mx + b form to get y = 27x - 54.

At x =3, y = 27(3) - 54 = 27, which is exactly equal to z, as we know it should be. If we look at another x value close to 3, say 3.1, we get z = 29.791. We get y = 29.7. As you can see, they are pretty close. As we get further away, the linear approximation will perform worse, especially for functions with a steep slope.

Now, let’s talk about some of the baseball relationships where this is applicable. Clay Davenport used to publish a team version of EQR in which (RAW/LgRAW)^2 approximated the percentage to which the team R/PA exceeded the league average. There is also a linear version (which is the only one I have seen Clay publish in some time), in which the mapping is 2*(RAW/LgRAW) - 1.

Let’s call RAW/LgRAW “ARAW” for adjusted RAW. The two relationships we have are ARAW^2 and 2*ARAW - 1. Now suppose we work with the exponential function and find the tangent line at the league average point, where ARAW = 1 and the result of the formula = 1 (this is common sense, as a team with a RAW equal to the league average should score runs at a rate equal to the league average). The slope of ARAW^2 is 2*ARAW, which is 2*1 = 2 when ARAW = 1. So y - 1 = 2*(ARAW - 1), and y = 2*ARAW - 1. As you can see, that is the other Davenport

This is no surprise, as even if Davenport derived the relationship through a regression approach, we would expect the best fit to be about the same as the point at the league average, since most of the teams are tightly clustered around that point.

Another stat which follows the same relationship to runs is OPS. David Smyth (and perhaps others, but I recall seeing David write it) has pointed out that the square of relative OPS (not OPS+, but straight OPS/LgOPS) tracks runs, and Steve Mann wrote about the similar 2*(OPS/LgOPS) - 1 relationship eighteen years ago in The Baseball Superstats 1989.

The most interesting relationship, though, is the Pythagorean win estimator. I have written about this before on my website. Pyth can be written as:

WR = RR^z

Where WR is the win ratio (W/L), RR is the run ratio (R/RA), and z is the exponent (usually seen as z = 2). We know that for an average team, RR = WR = 1. The slope of the function is z*RR^(z - 1). If z =2, then it is just 2*RR, which is 2 when RR = 1. If z = 1.83 (another common value), than it would be 1.83*RR^.83, which is 1.83*RR when RR = 1.

We know that W% = WR/(WR + 1). We can therefore write this as a W% estimator as W% = (2*RR - 1)/(2*RR).

This method of estimating W% was discussed, informally, by Bill James in the 1984 Baseball Abstract. James said that if a team scored 10% more runs than their opponents, they should win 20% more games. He wrote that he had never tried it but it “should work”, and dubbed it “Double the Edge”. I have no idea whether Bill came up with this through similar mathematical logic to what you see here, or whether it was intuitive. With James, I’d believe either.

Anyway, the good thing about this estimator is that it caps W% at 1. However, it does not bottom out at zero--a RR of less than .5 results in a negative W%.

Ralph Caola, who has done a lot of work on run to win converters, emailed me after reading the article on my site and suggested that to solve this problem, one could use two equations: one when Run Ratio is greater than one, and one when Run Ratio is less than one. For the less than case, you could define W% as 1 - (2*OppRR - 1)/(2*OppRR), where OppRR is the opponents’ run ratio, RA/R. This way, reciprocal run ratios would produce complementary W%s, as we would intuitively expect (and as Pythagorean gives).

This way, reciprocal run ratios would produce complementary W%s, as we would intuitively expect (and as Pythagorean gives).

There are dozens of ways you can write those formulas, and Ralph settled on W% = (R-RA)/(R + RA + ABS(R-RA)) + .5.

And sure enough, the equation is more accurate and more theoretically sound if you use Caola’s insight. However, I have recently realized that Ralph was not the first one to uncover this formula. In fact, it has been in the public eye for over twenty years and little has been said about it. (I am not necessarily bemoaning this, because the only reason to use the linear approximations to Pythagorean is simplicity. They are not preferable. However, with the increased presence of sabermetric research all over the place, I am a bit surprised that Ralph and I seem to have been the only ones to play around with James’ Double the Edge).

In The Hidden Game of Baseball, there is a brief description of several run to win methods in Chapter 4. In a footnote, Palmer/Thorn write “About a year after Pete’s article [in SABR’s The National Pastime] appeared, Bill Kross, a Purdue professor, devised an elegant little formula that was not only simpler than the others, but also very nearly as accurate, erring only when run differentials were extreme (+/- 200 runs). If a team is outscored by its opponents, Kross predicts its winning percentage by dividing runs scored by two time runs allowed; if a team outscores its opponents, the formula becomes, 1 - RA/(2*R).”

Remember what I said about there being dozens of different ways to write the DTE formula? I am not going to go through the algebra here, but suffice it to say that the Kross formulas are one of the dozens. I don’t know if Mr. Kross developed those by linearizing the Pythagorean formula, or through some other technique, but there it is. These formulas are not a breakthrough in accuracy, be it empirical or theoretical, but they are quick and easy and do have a strong logical foundation, and can even be seen as offshoots of Pythagorean estimators.

Monday, November 12, 2007

Leadoff Hitters, 2007

For the last two years I have written a piece giving the leading and trailing teams in various categories that can be used to evaluate leadoff performance. I always try to stress that, as numerous studies have shown, batting order construction is not as crucial as conventional wisdom holds it to be. I am personally much more concerned about how a player performs in an average situation than in any particular lineup slot.

Nonetheless, the matter of who will leadoff for a team is certainly one that is oft-discussed and is given particular attention by the men who run major league teams. Thus, it is useful to actually know which teams got good production out of the leadoff spot and which did not.

Before I start going into the various categories, let me first emphasize that the data is for team’s aggregate leadoff performance. In parentheses after each team on a list, I will give the names of the individuals who appeared in at least 20 games in the leadoff spot, but unless the player took every plate appearance of the team’s season in the #1 slot, the statistics are not solely his. Also, the 20 games does not mean 20 starts at leadoff hitter--it is 20 appearances, regardless of whether some of those came as a pinch hitter, pinch runner, defensive replacement, or what have you.

With the disclaimers out of the way, the most basic job of a leadoff hitter is to score runs. So runs scored per 25.5 outs (outs here are AB-H+CS) seem to be a good place to start:
1. PHI (Rollins), 7.2
2. MIL (Weeks/Hart), 7.2
3. DET (Granderson), 6.8
Leadoff Average, 5.6
ML Average, 4.8
28. CHA (Owens/Erstad/Podsednik), 4.5
29. STL (Eckstein/Taguchi/Miles), 4.5
30. WAS (Lopez/Logan), 4.1

Leadoff Average is the average for the team’s leadoff performances, while ML Average is the average for the league as a whole, slots one through nine. This is a sabermetric blog--I don’t need to point out to you the biases that exist in using actual runs scored data, so I will let those figures stand without comment.

Perhaps even more elemental to the traditional role of the leadoff hitter than scoring runs is getting on base. On Base Average is as important of a statistic as there is anyway, so it’s only natural to look at how the leadoff men did:
1. SEA (Suzuki), .389
2. LAA (Willits/Figgins/Matthews), .377
3. FLA (Ramirez/Amezaga), .376
Leadoff Average, .341
ML Average, .332
28. ARI (Young/Byrnes/Drew), .309
29. HOU (Biggio/Burke), .305
30. WAS (Lopez/Logan), .305

To me, the Angels high showing is a bit of a surprise, as Chone Figgins and Gary Matthews have never been huge OBA guys, and Reggie Willits was a relative unknown. On the flip side, seeing Craig Biggio and company in a virtual tie for last in baseball is somewhat sad.

A slightly modified version of OBA that is worth looking at is what I call the Runners On Base Average. ROBA removes home runs and caught stealings from the OBA numerator, leaving only those times in which a runner was actually on base to be advanced by his teammates. However, in this stat the home run is treated no differently than an out, so it is to some extent a “style” stat and not a quality stat. That is not to say that ROBA is not a practical thing to know--it is after all just the Base Runs A factor per PA. Just keep in mind that it is a statistic in which higher is usually, but not always, better:
1. SEA (Suzuki), .370
2. BAL (Roberts), .352
3. LAA (Willits/Figgins/Matthews), .351
Leadoff Average, .309
ML Average, .300
28. HOU (Biggio/Burke), .278
29. TOR (Rios/Johnson/Wells), .277
30. ARI (Young/Byrnes/Drew), .260

Not surprisingly, four of the extreme teams are holdovers from the OBA list.

Moving further down the path of style stats is Bill James’ Run Element Ratio, which divides walks and steals by extra bases. The idea behind RER was that it was the ratio of those events that are most important early in an inning (table-setting events with little advancement value like the walk) against those that are most important late in an inning, when runners are already on base (power). Singles are ignored because they serve both purposes well.

RER is not really a statement of quality at all, but a statement of shape. In theory, players with high RERs would seem to be better suited as leadoff hitters than those with low RERs, but it doesn’t necessarily mean that they are actually more productive in the role. I believe RER is most useful when discussing leadoff hitters as a tool to pick out players who don’t fit the conventional wisdom of what a leadoff hitter should be, but who were utilized as such:
1. MIN (Castillo/Casilla/Tyner/Bartlett), 2.3
2. LAA (Willits/Figgins/Matthews), 2.2
3. CHA (Owens/Erstad/Podsednik), 2.1
Leadoff Average, 1.0
ML Average, .7
28. HOU (Biggio/Burke), .5
29. DET (Granderson), .5
30. CHN (Soriano/Theriot), .4

As you can see, we have teams show up in the leaders who have previously been among the trailers in “effectiveness” categories, leaders who were previously leaders, and all other such combinations.

Going back to context neutral effectivness metrics, another Bill James’ invention was an estimated runs scored figure, based on assumptions about how often a leadoff hitter scored from each base (James used 35% from first, 55% from second, and 80% from third). I call this Leadoff Efficiency when viewed per 25.5 outs:
1. FLA (Ramirez/Amezaga), 8.1
2. MIL (Weeks/Hart), 7.6
3. BAL (Roberts), 7.6
Leadoff Average, 6.2
ML Average, 5.8
28. WAS (Lopez/Logan), 5.1
29. HOU (Biggio/Burke), 5.1
30. STL (Eckstein/Taguchi/Miles), 4.9

Of course, we can always just look at leadoff hitters the same way we would any other player, with a standard, context neutral run estimator. Using ERP as the estimator, here is good old Runs Created per Game:
1. FLA (Ramirez/Amezaga), 7.0
2. MIL (Weeks/Hart), 6.5
3. DET (Granderson), 6.5
Leadoff Average, 5.0
ML Average, 4.9
28. WAS (Lopez/Logan), 3.8
29. STL (Eckstein/Taguchi/Miles), 3.8
30. CHA (Owens/Erstad/Podsednik), 3.6

Finally, as David Smyth suggested for the first incarnation of this piece, we can look at a modified OPS with a weight of 2 for OBA. The most accurate weight for OBA is somewhere in the neighborhood of 1.7, so using 2 is closer to optimal than using 1, but serves to give a little extra boost to OBA, which may be justified when looking at leadoff hitters. The list presented below is actually (2*OBA + SLG)*.7, as the .7 multiplier makes it approximately equal to traditional OPS on the league level. Since we are dealing with meaningless units anyway, we might as well scale them to a meaningless scale with more familiarity (OPS):
1. FLA (Ramirez/Amezaga), 889
2. CHN (Soriano/Theriot), 855
3. DET (Granderson), 851
Leadoff Average, 767
ML Average, 761
28. STL (Eckstein/Taguchi/Miles), 685
29. WAS (Lopez/Logan), 679
30. CHA (Owens/Erstad/Podsednik), 673

If you are interested in looking at this stuff on your own, I have posted a Google spreadsheet with all of the data.