Sunday, June 30, 2013

Great Moments in Yahoo! Box Scores

Screengrab from 9:11 AM, a good 16 or so hours after this game ended.

Wednesday, June 26, 2013

Offensive Percentages and Overall Productivity

This one is really dated, so I’ll just point out that it was written in 2010.

Joe Mauer had the biggest power season of his career in 2009, and it was not surprisingly also his best overall offensive campaign. Still, Mauer was a very productive batter in 2006 and 2008 with much less power. How unusual was that? How good of a hitter would we expect someone of his relative H/W/P profile to be? Those are the kinds of questions that this discussion touches upon.

I was inspired to look into this by a Twitter conversation I had in mid-late June (2010, mind you). I had read a comment on BTF about how the Twins must be terrified by Mauer's power drop this year given the huge investment they made. This led me to tweet that Mauer had been arguably the best position player in the league in '06 and '08 while hitting for relatively little power, and that his value was not necessarily dependent on power. In addition to the power drop in '10, he'd seen his walk and single rate declines, and that was sapping his value as much as losing power.

Someone responded by saying that 2009 was the first season in which Mauer's power relative to the league was equal to his value relative to the league. I responded that his ISO ratio was much lower than his wRC ratio, and this led to a tangent about the slope of ISO versus runs.

However, the question that the conversation brought to mind was the typical relationship between overall offensive value and the share of that value that is derived from hits, walks, and power.

I'm going to look at all players 2000-2009 with 400 or more at bats in a season, and compare their H/W/P to their RG. Then I'm going to run some cringe-worthy regressions (but be comforted by the borderline freak-show nature of the topic itself), and then we can all find something more productive to do.

The strongest correlation between any of H/W/P with RG is H%, which has a r of -.64 (P% is +.54 and W% is +.34). H% has a negative correlation with RG; the higher the proportion of positive linear weight contribution (I'm going to stop using that mouthful and start calling it "value", but please remember what I really mean is positive linear weight contribution), the lower the RG.

The best way I found to estimate H/W/P from RG is to start by simply estimating H%. The best correlation for a simple regression comes from using the natural log of RG:

eH% = -.1883*ln(RG) + .9242

where RG = (TB + .8H + W - .3AB)*.324*25.2/(AB - H)

I'm a little hesitant to even mess with logs in such a trivial application, but it gives a slightly better fit and it does a better job of matching the high RG outliers (read: Barry Bonds). Fretting about those outlier Bonds seasons may be problematic from a statistical perspective, but I think it has some grounding in baseball logic. It makes intuitive sense that H% will be lower as RG increases; the upper bound of observed seasonal BA is around .420. A .420 hitter with little power (.08 ISO for a .500 SLG) and moderate walks (.475 OBA, which means .1 W/AB in this case) will only have a 9 RG. In order to be a historic-level performer, one has to excel in both batting average and secondary average. The log regression seems to strike a balance between the two.

After H% is removed, it's hard to find much of a correlation between RG and P%/W%. I figured the percentage of non-hit value contributed by power (P%/(P% + W%)), and its r with RG is just +.06. So I decided to keep it simple and simply use the average for everyone: 63% of non-hit value comes from P%, 37% from W%:

eW% = (1 - eH%)*.37
eP% = (1 - eH%)*.63

These estimators work pretty well for players when grouped by RG. In the chart below, "2" indicates players with RG between 2-2.99; "4" for 4-4.49; "4.5" for 4.5-4.99; "7" for 7-7.99, and so on:

Really, I could have just dispensed with the estimators and just used the chart to estimate H/W/P for players of different ability levels, but where would be the fun in that?

Here is Joe Mauer's actual and estimated H/W/P breakdown for 2005-2009 and the first half of 2010 (which is current, as of the moment I actually wrote this):

To this point, Mauer's 2010 has been just about his worst offensive season (without an adjustment for league scoring context). Mauer has always had a higher H% than the average player with his RG. Even with his 2009 power surge, he had a lower P% than expected (24 to 31%).

Mauer's career high P% is 24%. That is the typical value for a player with a RG of 4.8-5.3. So even in Mauer's best power season, his P% is below a typical P% for a player with a RG lower than that in Mauer's worst overall season (yes, that is an awful sentence).

While Mauer has an unusual profile, I wouldn't describe it as extremely unusual. The four largest deltas between H% and eH% in 2000-09 all belong to Ichiro Suzuki, with H%s over 75% with expectations in the high 50s. Juan Pierre and Placido Polanco are two other players whose names pop up on that list. Limiting the group to players with RG > 7, Mauer is the only batter whose name appears twice in the top ten deltas.

Looking at the P% deltas for players with RG > 7, Mauer's 2006 was the largest (19% actual, 29% expected), and his 2009 was tenth. Barry Bonds' 2002 even manages to rank fifth despite 45 homers, because 22% of his value came from walks.

Whether Mauer is able to approach the value projection implied by his contract without retaining some of his power games is a question best left for the projection mavens. However, just looking at his career to this date, Mauer's power has always made up much less of his value than a typical player at his level of offensive productivity, and his 2009 was no exception (albeit slightly less extreme). At least to this point, Mauer has been one of the most valuable hitters in the game while relying on power to the same extent as a league-average performer.

Monday, June 03, 2013

Offensive Percentages

I wrote this post (and its companion which will go up at a later date) in 2010, but didn't like it enough to publish it here. However, this topic came up on Twitter recently, and Sky Kalkman wrote up his take on it here. Since I already had this written, I thought I might as well add it to the conversation.

In one of his early national Abstracts, Bill James published a method that estimated the percentage of a player's offensive value (actually, his Runs Created) which was derived from Batting Average. Since RC is simply (H+W)*TB/(AB+W), one can calculate a player's RC in lieu of power and walks by taking H^2/AB. Dividing this by RC gave James an estimate of what percentage of his contribution came from base hits alone.

James' method was later expanded by Gerry Myerson in the Big Bad Baseball Annual to estimate the share of RC derived from walks and power. James Fraser (whatever happened to him, anyway?) later applied a similar approach to Extrapolated Runs.

Let's start with a simple, static linear formula, basically Paul Johnson's ERP. This is not the most precise run estimator available, but it's easy to work with and is good enough for this type of application:

RC = (.5H + TB + W - .3(AB-H))*.324 ~= .49H + .32EB + .32W - .1(AB - H) = .59H + .32EB + .32W - .1AB

It is pretty easy to split this up into the basic components of hits, walks, and power (as shown). However, there is the little problem of the negative runs that are charged for outs made. If you lump them in with hits, the share of offense contributed by base hits will be driven down. If you ignore them and compare to total RC, you'll end up saying that the percentage of value contributed by hits, walks, and power combined is greater than 100%, and by a different amount for each player. So instead, I’ll look at the contribution of hits, walks, and extra bases towards the positive linear weight value, and ignore the negative from outs. I make no claim that this is the optimal way to do this, but it seems like the least bad alternative.

Since we're not dealing with actual RC figures anymore, we can safely ignore the .324 multiplier and make it real simple:

Pos = .5H + TB + W = 1.5H + EB + W

The percentage of positive linear weights contributed by hits, walks, and power (extra bases) is straightforward:

H% = 1.5H/Pos

W% = W/Pos

P% = EB/Pos

I'm not quite sure how to express this coherently, but these percentages don't really represent the portion of a player's overall offensive value arising from those three components. It represents the share of a player's absolute positive Runs Created that arises from those three components. If you tried to apply this approach to absolute RC, it would fall apart, because you have to do something about the outs. If you tried to apply this approach to a baselined metric (RAA, RAR), it would really fall apart. You would have players with a negative denominator, and thus negative percentages, players with negative hit contributions but a negative denominator resulting in positive percentages, and all manner of results which wouldn't make much sense.

The bottom line is that, as Bill James explained when he introduced his version, you can't use the percentages literally. That doesn't make these percentages useless, but it does make them more of a freak show stat than they otherwise might be. Still, if you don't treat the percentages as literal, but as abstractions, and only compare them relatively between players, they have the potential to yield some insight.

Let's begin with the major league percentages for 2009 [I'm going to display these as (H, W, P) from this point]:

AL: (61, 15, 24)
NL: (61, 16, 23)

Simply collecting base hits is responsible for 60% of the positive run value in the majors. It's not that batting average is worthless--if you break OBA and SLG down into the portions derived from base hits and walks (OBA) or power (SLG), the hits portion is more important. The problem with BA is that it doesn't add much additional information given that you already have the more complete metrics. Getting hits is still a very important part of offense, and no sabermetrician will ever tell you otherwise.

Of course, the way I've split things up is to put the first base of every hit together. You could split off singles on their own, and leave the first bases of extra base hits in the "power" grouping, and of course the share of positive value credited to "power" would go up. Personally, I think this kind of approach is more useful if the extra bases are spun off.

In any event, players will have much more extreme profiles than the league as a whole. Consider these four players from the 2009 AL:

Suzuki: (76, 7, 16)
Punto: (60, 30, 10)
Pena: (41, 22, 37)
Delmon Young: (71, 5, 24)

Ichiro lead in H%; Punto led in W% and trailed in P%; Pena led in P% and trailed in H%; and Delmon Young was last in W%.

The disclaimer about abstraction can be illustrated by example. Compare Suzuki and Punto. 7% of Suzuki's positive linear weight total came from walks, while 30% of Punto's did. Suzuki 's walk rate was .059, Punto's was .145. If we could use the percentages literally, than Suzuki's overall rate of offensive productivity would be proportional to .059/.07 = .843 and Punto's .145/.3 = .483. It doesn't matter whether you use RC/PA, RC/O, or any other sensible overall rate--you're not going to be able to reconcile the players' ratio in those metrics and the players' ratio in non-sense units. You might be able to tie them loosely to an overall metric--after all, they can be tied back to "Pos" by definition. However, the positive linear weight values on their own, without subtracting or dividing by outs in any way, don't capture the full extent of a player's offensive productivity.

Next time, I'll look at how H, W, and P% look for hitters when they are grouped by overall productivity. To be one of the very best hitters, a player is going to have to contribute in all three areas--a player like Ichiro gives us some hint as to the upper limit for a player with very little secondary contribution. Looking at hitters breakdowns by quality groups will not provide much of analytical value, but it does help in identifying players with unique styles.