Wednesday, November 30, 2005

The Fallacy of the Ecological Fallacy

From time to time, someone who has a background in formal statistics will claim that applying various measures tested at the team-level to individual players(usually a run estimator) is falling prey to the Ecological Fallacy and is thus invalid.

Not having a formal statistics background, it may be hazardous to talk about something that I don’t fully understand. But I can tell you that to the extent that I understand the ecological fallacy, the idea that it applies to individual runs created estimates is hokum.

According to this link, the ecological fallacy occurs when “making an unsupported generalization from group data to individual behavior”. They then use an example of voting. One community has 25% who make over $100K a year, and 25% who vote Republican. Another has 75% who make over $100K and 75% who vote Republican. To use this data to conclude that there is a perfect correlation between individuals voting Republican and making over $100K would be the ecological fallacy. In fact, they show how the data could be distributed so that the correlation between individuals voting Republican and making over $100K is actually negative.

People will then go on to claim that since Runs Created methods are tested on teams, it is wrong to apply them to individuals and assume accuracy. It is true that multiplicative methods like Runs Created and Base Runs make assumptions about how runs are created that are true when applied to teams but cannot be applied to individuals(the well-documented problem of driving yourself in; Barry Bonds’ high on base factor interacts with his high advancement factor in RC, but in reality interacts with the production of his teammates). It is also true that regression equations have many potential pitfalls when applied to teams, let alone taking team regressions and applying them to individuals. However, these limitations are well known by most sabermetricians(although some stubbornly continue to use James’ RC for individual hitters).

The ecological fallacy claim, though, is extended by some to every run estimator that is verified against team data. The claim is that there “need not be little to no connection between team-level functions and player-level functions”. I also saw a critic point out once that run estimators did not do a good job of predicting individual runs scored.

My retort was that the low temperature today in Mozambique did not do a good job of predicting individual runs scored either. To assume that the team runs scored function and the individual runs scored function are the same is to be ignorant of the facts of baseball. A walk and a single have an equal run-scoring value for an individual, and a home run will always have an individual run-scoring value of 1. This is not true for a team, because, except in the case of the home run, it takes another player to come along and drive his teammate in. In the team case, all of these individuals stats are aggregated. The home run by one batter not only scores him, it scores any teammates on base. And therefore the act of scoring runs, for a team, incorporates advancement value as well. A single will create more runs, in average circumstances, then will a walk.

Therefore, when we have a formula that estimates runs scored for a team, it does not estimate the same function as runs scored for a player. It instead approximates another function that we choose to call “runs created” or “runs produced” or what have you. Now it could be claimed, I suppose, that the runs created function cannot be applied to individuals? But why not? If a double creates .8 runs for a team, and a hitter hits a double, why can’t we credit him with creating .8 of the team’s runs? All we are doing is assigning what we know are properly generated coefficients for the team to the player who actually delivered them. Or you can look at it, in the case of theoretical team RC, that we are isolating the player’s contribution by comparing team runs scored with him to team runs scored without him.

Furthermore, the individual runs created function and the team runs scored function are the same function. They have to be. Who causes the team to score runs, the tooth fairy? In the case of the voting situation which was said to be the ecological fallacy, you are artificially forming groups of people that don’t actually interact with each other. I can vote Republican, and you can vote Republican, but we’re not working together in that. You can vote Democrat and I can still vote Republican; our choices are independent. Then you make this group that voted Republican, and look at the their income, and yes, you can reach misleading conclusions.

The point I’m trying to make is that voting is not a community-level function, and therefore it is wrong to attribute the community level data pattern to individuals. People vote as individuals, not as communities. But scoring runs is a team-level function. People create runs as teams, each contributing. If we use a different voting analogy, that of the electoral college, people cast electoral votes as states. And therefore we can break down how much of the electoral vote of Montana that each citizen was responsible for(one share of however many if they voted for the winning candidate, zero if they did not). And that’s what we are doing by looking at individual runs created.

I think the problem, and I don’t mean this to apply to all statisticians who dabble in sabermetrics, but to some, particularly those who don’t have a strong traditional sabermetric background to go along with their statistical knowledge, is that they tend to take all of the things they know can often happen in statistical practice and apply them to sabermetrics, without seeing whether the conditions are in place. In the same way, they will use statistical methods like regression when they are not necessary. If you are studying phenomenon that you don’t have a good theory on, then regression can be a great tool. But if you are studying a baseball offense, you’re better off constructing a logical expression of the run scoring process like Base Runs or using the base/out table to construct Linear Weights. You don’t need a regression to ascertain the run values of events--baseball offenses are complex, but they are not nearly as complex as many of the other phenomenons in the world.

Tuesday, November 29, 2005

Rate Stats, part 1

A while back I attempted to write an article for my website on rate stats for batters. I wrote about a page and got frustrated and quit. It was not going to be anything groundbreaking, just a summary of the existing work in the area mixed with some of my personal thoughts, just like most of the other articles on my site. I have decided to try again, but instead I will just write several blog segments and copy and paste when I'm done.

In sabermetrics, we usually express individual hitter contributions in terms of an estimated number of runs created, because the goal of a baseball offense is to score runs. This is extremely clear on the team level, and it logically follows that if the team's goal is to score runs, the player's goal should be to create runs for his team. There is a little caveat that complicates everything, however. Everything a team does offensively is captured in its statistics(I do not mean things like taking the extra base or hitting with runners scoring in position and other stuff that is not accounted for in the official statistics). When the team avoids an out by reaching base, and therby creates another opportunity for itself, whatever production is created by the extra opportunity is included in their statistics. When an individual does this, he gets credit for reaching base, but he is not explicitly credited for creating an opportunity for a teammate(depending on what kind of metric we are using to evaluate him, which adds more complexity).

So for a team, it is all very simple. The goal is to score runs, specifically as many runs as possible within the constraints set by the rules of the game. In most sports, the constraint is time, but in baseball, it is outs. A team's goal is to maximize the number of runs it scores per out. Since innings = 3*outs, you could also state that the goal is to maximize runs scored per inning. Since games = 9*innings, it could be stated as maximizing runs scored per game(of course not all innings have 3 outs, particularly 3 outs that can be found in the official statistics, and not all games have 9 innings, but you get the point. In theory this is true and in practice it is close enough to true to not cause any problems). And since seasons = 162*games, you can say the goal is to maximize runs scored for the entire season.

So really, for teams, you don't need a rate stat. The only reason we use R/G or R/Inning or R/Out for a team is because of the variance from the theoretical conversions. But if there was no variance from the theoretical conversions, runs scored would be the only stat you would need to know for a team's offense.

It is absolutely clear that Runs per Plate Appearance(R/PA) is an inappropriate measure of a team's offense. The number of plate appearances is a product of the team's performance. Higher on base averages lead to more plate appearances. If two teams score 800 runs, but one does it in 6200 PA and one does it in 6400 PA, the first team may have had an offense more slanted towards power, and the second towards getting on base. But they are equivalent in their impact on the game. Scoring one run in an inning on 3 outs and a home run is worth exactly the same as scoring one run in an inning on 3 outs and 4 walks. R/PA may have use as a descriptive stat for a team, but it is not a measure of their offesnive productivity.

Next time: R/PA and R/O applied to individual batters

Monday, November 28, 2005

Awards

I had intended to post my complete ballots for the Internet Baseball Awards here but I seem to have misplaced the paper I wrote them down on, and they are lost to history. I could of course recreate them, and be pretty close to what I actually voted, but instead I will simply review the winners and who I would have picked.

My ROY picks were Ryan Howard and Huston Street. These were also the BBWAA and IBA winners.

My NL Cy Young choice was Roger Clemens, while my AL choice was Johan Santana(last year's BBWAA choices, although Johnson deserved the NL award). The BBWAA chose Carpenter and Colon, while the IBA chose Clemens and Santana as I did.

The choices of Carpenter and Colon by the BBWAA have to be frustrating to people who get exercised about this sort of stuff. They are a return to the "best W-L record automatically wins" days. Carpenter pitched 241 innings with a 3.12 RA, for +65 above replacement and +36 above average. Clemens pitched just 211 innings, but posted a remarkable 2.13 RA for +80/+54. I don't think it's particularly close. I also have Pettitte and Oswalt ahead of Carpenter, although not to the same extent Clemens is. And behind Carpenter, Dontrelle Willis and Pedro Martinez are very close as well.

Examining Carpenter and Clemens more closely, Carpenter got 5.51 runs of support(5.62 park adjusted). Clemens got 3.58(3.51 park adjusted). The average NL pitcher, with a RA of 4.45, and an EW% slope of .107, would put up a .625 W% with Carpenter's support and just .399 with Clemens'. Therefore, Carpenter(21-5) was +4.75 wins in his 26 decisions and Clemens(13-8) was +4.62 in his 21. Pretty much a dead heat. And you expect about 1 decision for every 9 innings, which means Carpenter should have had 26.8 and Clemens should have had 23.4. So Clemens, in addition to getting lower run support, got less decisions, even when compared to innings.

If you figure this comparison based on replacment level instead of average, Carpenter will do better still because he had more decisions, but the difference will be small and is not nearly enough in my opinion to justify overlooking Clemens' huge advantages in run prevention.

The AL Cy is even worse, since Santana was 4.9 wins above an average pitcher with his run support while Colonw as just 2.2 better. Santana pitched 231 innings, Colon 222. Santana had a 2.97 RA; Colon 3.84. Santana was +77/+46; Colon +52/+23. No one was particularly close to Santana either; Halladay had better rate stats, but pitched 90 less innings due to his injury.

My MVPs went to Rodriguez and Lee; the IBA and BBWAA awards went to Rodriguez and Pujols. The choice of Pujols is not a bad one by any means; perhaps a close analysis of situation data could even make it seem obvious. What was puzzling, from the writers, was Andruw Jones finishing second and Lee third. Apparently the contending team pseudo-requirement was taken to its extreme. Interestingly, though, no NL player other than Pujols, Lee, and Jones recieved a third place or better vote. I don't see why, though, in the case of Jones. He had an OBA of .335 and made 435 outs while creating 112 runs. Contending players like Brian Giles and Morgan Ensberg had superior seasons offensively. Certainly Jones' defense is valuable, but to get up with Derrick Lee he's got thirty runs to make up.

Lee and Pujols were very close in many categories. Lee batted 679 times, Pujols 688. Lee made 398 outs, as did Pujols. Lee hit .335, Pujols hit .333. Lee had a .418 OBA, Pujols .428. It was in power that Lee staked his claim with a .662 SLG and .327 ISO versus .615 and .282 for Pujols. Lee created 150 runs, Pujols 146. Lee was 90 runs better then a replacement level first baseman, Pujols 86. Lee was 68 runs better then an average first baseman, Pujols 64.

One also cannot shake the feeling that it may have been a lifetime achievement award for Pujols. He has finished behind Bonds the last couple years with some tremendous seasons, and this year, with a much better claim to being the best in the league, it was tough not to give it to him.

Saturday, November 26, 2005

BJ Ryan and JP Ricciardi

Apparently the Blue Jays are going to give BJ Ryan five years and forty-seven million. Absolutely insane. With the recent firing of Paul DePodesta and the resignation of Theo Epstein, the “sabermetric” GMs have been reduced to Billy Beane in Oakland and JP Ricciardi in Toronto. Ricciardi was never considered as much of a true believer as the others, but since he worked for Beane, he has gotten that label. With this signing, it is clear that something has happened, or that he was not a sabermetric GM to begin with.

Ryan pitched 50 innings in 2003, with a 3.85 Relief RA, a 3.74 eRA, and a 3.52 G-F, good for +6 RAA. In 2004, in 87 innings he had a 2.31 RRA, a 2.96 eRA, and a 3.31 G-F for +26 RAA. Last year, at age 30, he worked 70 innings with a 2.49 RRA, a 2.93 eRA, and a 3.29 G-F for +18 runs. So he is clearly an outstanding reliever. He led all AL Relievers in G-F, and was tenth in eRA and second in GRA. Along with Mariano Rivera, he is the class of AL relievers. But to invest almost ten million in him, for a team that supposedly is not flush with money, seems awfully foolish. Particularly from someone (presumably) with a sabermetric perspective and a recognition that the proven closer label is silly, that what you want is a quality pitcher period.

Anyway, if you assume that a “sabermetric GM” would not sign this contract, what are the possible explanations:

1. Ricciardi never was a sabermetric GM

2. Ricciardi was a sabermetric GM, but has learned through experience that the approach(at least with respect to closers) will not work

3. Ricciardi is still sabermetrically-minded, but outside influences, be they direct orders, fear of being fired, or media pressure have caused him to do something against their better judgement

4. Ricciardi wants to marry B.J. Ryan’s sister