Friday, December 16, 2005

Win Shares Walkthrough, pt. 2

Park Factors in Win Shares
Win Shares requires park adjustments at various points, so I will first go through and calculate the park factors for the 1993 Braves. The Win Shares Park Factor approach is fairly conventional. Five years of data are used when available, but the current year composes 50% of the PF with the other years getting equal shares of the final. We first calculated RPG at home and away for each season of data we are using (this includes Runs and Runs allowed). We then find the PF in two steps:
iPF = RPG(H)/RPG(R)
PF = ((T-1)*iPF+T-iPF)/(2*(T-1))
Where T is the number of teams in the league. In 1993, the Braves scored 366 and allowed 291 in 81 games at home, and scored 401 and allowed 268 in 81 games on the road. So RPG(H) = (366+291)/81 = 8.11 and RPG(R) = (401+268)/81 = 8.26. The iPF = 8.11/8.26 = .982. Then, since there are 14 teams in the league, PF = ((14-1)*.982+14-.982)/(2*(14-1)) = .992. Once we have found the PF for each season, we find the weighted average to get the PF(R), the Runs Park Factor. If more then one season of data is used, then the weight on the focus season is .5 and the weights for the other seasons equally make up the other .5. For instance, if 2 seasons are used, each are weighted at .5. If 5 are used, as in this case, the focus year is .5 and the others are each .125. As best as I can figure, Bill uses only 1993, 1994, and 1995 for ATL in this case, so 1993 is weighted at 1/2 and 1994 and 1995 are each weighted at 1/4. Finding the other year's PFs and applying the formula, we get a PF(R) for the Braves of .998.

PF(HR) is found in the same way, substituting HR data for R data. We get a result of 1.019. Then James figures a third PF, the “Park-S Adjustment”, which is the effect of the park on everything other then Home Runs. The formula for this is:
PF(S) = sqrt((PF(R)-LHR%*PF(HR))/(1-LHR%))
Where LHR% = Lg(HR/R)*1.50
For the Braves, the league hit 1956 HR and scored 10190, so LHR% = 1956/10190*1.5 = .288. Then PF(S) = sqrt((.998-.288*1.019)/(1-.288)) = .995

My Take: The Win Shares Park Factors are very solid. The conversion from the initial factor to the other factor is essentially equivalent to the one I use(which was published by Craig Wright). The one disagreement I have is with the weighting. I think a one-half weighting on the current year is too much, but James cites weather and other factors that may well change from year to year as a reason for this. Just a small difference of opinion.

The Park-S adjustment is actually quite clever, and I wish I would have thought of it myself. The question that I have is why is league data used? The league average is irrelevant to the Braves games--their stats and their opponents’ stats should be considered. Also, the linear weight value of 1.50 for a home run is obviously just a quick and reasonable value, but given the attempt to have precision in other areas of the Win Shares method, it seems very imprecise. So throughout this walkthrough I will just say “that is imprecise”, and that will mean that while it is a reasonable estimate and may work in almost all cases, it could be estimated more precisely as so many things in this method are.

The PF(S) uses a square root because hits and other “run components” have a roughly square relationship with runs. For instance, Pete Palmer uses the square root of the runs PF to adjust OBA and SLG in PRO+. The reason for this is the basic, approximate Runs = OBA*SLG relationship discovered by Dick Cramer in his BRA method, and later used by Bill James to create basic Runs Created.

Dividing Win Shares Between Offense and Defense
The first important step in Win Shares is to divide credit for the team wins between the offense and the defense. This is done by the percentage of marginal runs that each provides.

First, we calculate expected runs and runs allowed
ExpR = Lg(R/I)*PF(R)*IB
ExpRA = Lg(R/I)*PF(R)*IP
In the 1993 NL, the average R/I is 10190/20284 = .5024. The Braves
IB is innings batted, which is estimated as IB = IP - W(H) + L(R), where W(H) is wins at home and L(R) is losses on the road. In the 1993 NL, the average R/I is 10190/20284 = .5024. The Braves pitched 1455 innings and won 51 home games while losing 28 road games, given them an IB of 1455 - 51 + 28 = 1432. So the Braves expected R and RA are:
ExpR = .5024*.998*1432 = 718.16
ExpRA = .5024*.998*1455 = 729.70
Then the marginal runs are found:
MR = R - ExpR*.52
MRA = ExpRA*1.52 – RA
The Braves scored 767 runs and allowed 559, so MR = 767 - 718.16*.52 = 393.56, and MRA = 729.70*1.52 - 559 = 550.14.
Finally, we split the Win Shares(there are 3 win shares for each team win):
OWS = MR/(MR+MRA)*W*3
DWS = MRA/(MR+MRA)*W*3
So the percentage of total marginal runs coming from offense is defined as the percentage of wins coming from offense. The Braves’ offense accounts for 393.56/(393.56+550.14) = 41.7% of the team wins, and the defense the other 58.3%. Since the Braves won 104 games, there are 312 win shares to go around, 130 for the offense and 182 for the defense.

My Take
: This is one of the crucial steps in the process, and it is a fairly clever one, certainly not something I would have thought up. I’m not sure it works, but it’s clever either way. What James does is calculate marginal runs against some very low baseline. He never explains exactly what this baseline is supposed to represent, other then that it works. He uses .52 for the offense and 1.52 for the defense. Let’s call the league average runs/game L. James says:
W% = (R-.52L + 1.52L-RA)/(2*L)
In other words, RPW = 2L, since we can rewrite this equation as:
W% = (R-RA)/(2L) + .5
It does not matter if you use .52 and 1.52, or .5 and 1.5, or .6 and 1.6--as long as there is a difference of 1 between the two baselines, the team W% formula will hold.

The team winning percentage formula is correct at .500, or for a team whose RPG (offense and defense) is equal to 2*L (if we assume that Runs per Win = RPG). The league context does not matter in determining the number of games a team will win. The runs/wins converter depends on the scoring context, but it depends on the scoring context of the team, not that of the league. (R-RA)/(R+RA) + .5 is a much better estimate of team W% then James' method. But James must use it so that he can compare to the league. But this will cause increasingly large distortions as team move away from the average situation where the formula holds.

An average team will have R = RA = L, and therefore have MR = L-.52L and MRA = 1.52L-L. Plugging these MR and MRA into the win share splitting formula of MR/(MR+MRA) simplifies to .48. In other words, for a perfectly average team, 48% of wins will be attributed to offense and 52% to defense.

This seems a little off on first blush, as offense and defense are usually assumed to be symmetrical in sabermetrics. There is very little difference in run distribution patterns for offense and defense, which leads to a conclusion of a 50/50 split. James counters by pointing out that the number of runs a poor team could score is limited at zero, while there is no limit to the number of runs a poor team could allow. James also says that the very worst teams, historically, have had relatively worse defenses then offenses. Additionally, he claims that using .52/1.52 gives “better” results then using .5/1.5(i.e. pitchers are not rated as lowly in comparison with batters as they would be otherwise).

I believe that part of the reason that pitchers rate poorly is because pitching is split between pitchers and fielders at the team level, where every pitcher actually relies on their defense to a different extent. I personally don’t feel that his evidence is strong enough to abandon the comfortable .5/1.5 split, but since I don’t have any strong evidence to counter with, I will leave that be. Instead we can look at the properties of this procedure. Suppose we have a league where the average is 5 runs per game and we have a team with an average defense(RA = 5) and a margin-level offense(R = .52*5 = 2.6). This team, according to the W% formula above, will have a (2.6-5+5)/(2*5) = .260 W%. Since there are no marginal runs, all of the teams wins will be credited to the defense(.260).

Suppose we had instead a team which was perfectly average. They would have a .500 W%, and 52% of this would be attributed to the defense, or .260 wins. This illustrates the very useful property of this assumption--an average defense will always get the same number of wins credited to it, regardless of what the offense does.

Of course, this breaks down if either component of your team is sub-marginal, as it will be assigned negative wins. This does not make any sense, but is necessary to keep the other component of your team at its necessary level. The problem is that some individuals on this team will be above the margin, and therefore will be assigned negative win shares for super-marginal performance. Of course, actual teams don’t exist at these extremes, but if a system collapses at .52, what is it doing at .7? There some point at which the system starts to act screwy, and that point may be on the fringe of the real data section. Or it may not.

So the wins that will be assigned to either your offense or defense will be their marginal runs divided by 2 times the league average. In this and all of the above discussion, I am assuming that the team’s actual performance is equivalent to its W% estimation.

No comments:

Post a Comment

I reserve the right to reject any comment for any reason.