Monday, April 25, 2016

LWR and Component Deflators

If I tell you that a hitter has a line of .260/.330/.400, and I tell you that he plays in a park that inflates run scoring by 5%, what would you estimate his batting line would be in a neutral park? Suppose we know nothing at all about the park other than its effect on runs scored; we don’t know how it changes the rates of home runs, or strikeouts, or hits, or any other statistical category. We don’t know the dimensions or the altitude or the fence height or the type of playing surface, so we can’t make any estimates based on those factors either. How are you going to answer this?

One path you could take is to assume that the park influences the rates of all events equally. You could assume that the park will increase the number of singles by X%, the number of walks by X%, the number of home runs by X%, etc. Will this necessarily be the best approach? No; after all we expect the maximum real world park factor for walks to be much less extreme than for home runs, for instance. It may be an acceptable approximation, but it would be a stretch to say it was the optimum approach

However, what if we are only concerned about value, and don’t wish to adjust individual components? Personally, this is the question that interests me the most. I don’t care if the park benefits a certain type of hitter more than another; I just want to know what the effect was on runs scored. If a batter’s style or approach enables him to take more advantage of a park than the average batter, I don’t wish to take that credit away from him, as it generates real value for his team.

So if I want to adjust a player’s slash line for park, I don’t care about the precise park effect on doubles, walks, or strikeouts. I want to answer “What batting line would provide equivalent value in a neutral park, assuming that the proportional relationships between the components of this man’s line were a constant?” In other words, if this batter had a 2:1 ratio of singles to extra base hits in reality, I want him to still have a 2:1 ratio when we’ve adjusted his line for park.

This is where the approach mentioned above comes into play; assume the park has an equal effect on each mutually exclusive component of the batting line, and go from there. In order to start this process, we will make the assumption that the linear weight values will remain constant as we move between environments. Obviously this is a faulty assumption; linear weight values are dependent on context. However, linear weight values are also fairly stable over similar run environments, particularly in the case of the out value when we are using the -.1 type. A park increasing run scoring by 5% shouldn’t have too dramatic of an effect on the coefficients so as to render our conclusions invalid. Nonetheless, for more extreme parks, the potential for problems will be larger.

Let us define a new variable, a. a is what I will call the “component deflator” (I am borrowing the term “deflator” from Stephen Tomlinson’s “run deflator” as defined in the Big Bad Baseball Annual). Assuming stable linear weight values, using the definition of terms from the last post, and limiting the scope of our categories to the basic mutually exclusive offensive categories (singles, doubles, triples, homers, walks, batting outs) we can start by saying that:

RC/PF = new RC

New RC/Out = (sS*a + dD*a + tT*a + hHR*a + wW*a + x*(1 - S*a - D*a - T*a - HR*a - W*a))/(1 - S*a - D*a - T*a - HR*a - W*a)

All we have done is assume that the frequency of each event will be equally effected, by a factor of “a”. We can also simplify the out term to be 1 - a*(S + D + T + HR + W).

Just as seen in the last installment, we can cancel out the out terms from the numerator and the denominator, and thus save ourselves a lot of hassle, and write everything in terms of Linear Weight Ratio:

New LWR = (S*a + d'D*a + t'T*a + h'HR*a + w'W*a)/(1 - a*(S + D + T + HR + W)

Where new LWR = (New RC/O - x)*s'.

I have kept symbols everywhere to keep this as general as possible, but let’s remember what they actually are. d', t', etc. are all known values, all fixed coefficients. S, D, T, etc. are simply the frequencies of a set of mutually exclusive events. The only unknown variable in this equation is a, the common deflator.

Occasionally I find it necessary to include a disclaimer that I am not a mathematician, and this is one of those times. The way I am going to describe solving for severely overcomplicates the matter and makes the connection between LWR and a seem tenuous at best. And it’s true, you don’t need to convert to LWR in order to do this type of approximation; I just like doing in that way because of the aforementioned canceling out of the out term in the RC/O numerator and denominator.

To solve for a, let’s define the LWR numerator as N:

N = S + d'D + t'T + h'HR + w'W

One way to look at this is that, in effect, we have stated the player’s positive linear weight contributions from all events as an equivalent number of singles, since singles have a weight of one.

Let’s also write the denominator in an equivalent number of singles. Find (S + D + T + HR + W)/S and call this D (sorry for doubling up with doubles here). This the ratio of all non-out PA outcomes to singles.

Then, the New LWR can be viewed as this:

New LWR = N*a/(1 - D*S*a)

We have reduced all of the events down to an equivalent number of singles, and can solve for the ratio of singles under our new conditions to singles under old conditions that result in the desired new LWR. This is a, and it is the same ratio that will apply to the other events (except outs, which have to be handled differently):

a = New LWR/(N + D*S*New LWR)

Now, our player’s new rate of singles will be S*a. His new rate of doubles will be D*a, his new rate of home runs will be HR*a, and so on for all events except outs. His new rate of outs will be 1 - S*a - D*a - T*a - HR*a - W*a, or substitute “PA” for “1” if you are using the actual count of each event rather than the per-PA frequencies. The outs can also be adjusted as 1 - a*(1 - Outs), or PA - a*(PA - Outs), depending on whether you are using frequencies or counts.

Those of you who are astute and who are not totally bewildered by the circuitous way I defined terms and got to this point (which should eliminate most of you, since if you are in fact astute you are rightfully thinking “What the heck is wrong with this guy?”) may notice that the execution here is similar to Bill James’ “Willie Davis method”. And that it is. James converts a player’s batting line into an equivalent number of singles, finds the proportion of translated singles to original singles necessary to yield the right new number of Runs Created (which involves the quadratic formula due to the nature of the RC formula), and adjust the other events accordingly. So the procedure I’m using here is not in anyway new or unique, it is just an application of it in the case of Linear Weights.

Let me walk you through an example, since I’ve made this confusing as all get out. Let’s review the ERP-based LWR that I derived last time for example purposes:

LWR = (S + 1.67D + 2.33T + 3HR + .67W)/(1 - S - D - T - HR - W)

Let’s suppose that we want to take a league-average player from the 1990 NL and project his statistics in an extreme park, a mid-90s Coors type park with a 1.20 PF. Here are his statistics:



With some basic algebraic manipulations on the equations in the last post, we can go directly from LWR to New LWR by this formula:

New LWR = ((LWR/s' + x)*adjustment - x)*s'

Where we recall that x is the linear weight value of an out (-.097 in this case), s' is the reciprocal of the linear weight value of a single (2.058 in this case), and adjustment is the scalar effect on runs/out (1.20 in this case). So:

New LWR = ((.545/2.058 - .097)*1.2 + .097)*2.058 = .614

From here, we need to find “N” and “D”:

N = S + 1.67D + 2.33T + 3HR + .67W = .167 + 1.67(.041) + 2.33(.006) + 3(.021) + .67(.086) = .370

D = (S + D + T + HR + W)/S = (.167 + .041 + .006 + .021 + .086)/.167 = 1.922

And now we can solve for a:

a = New LWR/(N + D*S*New LWR) = .614/(.370 + 1.922*.167*.614) = 1.083

In order to increase this player’s RC/O by 20%, we need to increase his singles, doubles, triples, homers, and walks by 8.3% each. This yields a new batting line of:



So this player has gone from hitting .257/.321/.384 to hitting .281/.348/.419. His park-adjusted value has been held constant, as have his relative frequencies of each positive PA outcome. The key is that he has more of all the positive events, and thus less outs.

In case you are curious, from the limited set of frequencies defined here, BA is (S + D + T + HR)/(1 - W); OBA is S + D + T + HR + W; and SLG is (S + 2D + 3T + 4HR)/(1 - W)

Monday, April 11, 2016

Linear Weight Ratio

Note: The series of four posts I will be posting over the next month were written a long time ago, apparently in 2009. Since I have not been prolific in producing new material lately, I figured I might as well post some older stuff I’ve written that at the time I didn’t deem good or interesting enough to post. I did not vet all of the material in them, so any inaccuracies are my fault but do not necessarily reflect my current thinking.

Linear Weights Ratio (LWR) is an offensive metric developed by Tango Tiger, based on Linear Weights. Since it was developed and explained by Tango, there is really no need for me to step in and write a post that may just serve to confuse you. And I have not defined everything in exactly the same way he did, which will only add to the confusion.

I have always liked to write descriptions of other people’s research, for a couple of reasons. One is as a sort of critique/peer review, which does not have to be critical--it can also point out the positives about an approach. A second is so that if I use something later (and I have an upcoming post that uses LWR in the vein of Bill James' "Willie Davis method"), my readers can have some degree of confidence that I understand the topic at hand. All too often you will see people use metrics that they don’t really understand. By writing about the ones I’m using, I will be presenting you with sufficient evidence to draw your own conclusions as to whether or not I understand the tools I am using.

Let’s begin by focusing only on the basic, mutually exclusive offensive events: singles, doubles, triples, home runs, walks, and batting outs (AB - H). For now, we will assume that those categories encompass every possible outcome of a plate appearance. Let us also assume that we have some set of linear weights which give the value of each of those events: s is the value of a single, d of a double, t of a triple, h of a home run, w of a walk, and x of an out. Additionally, I am approaching this problem with absolute (total runs scored) weights, so x is something like -.1, not -.3. Tango’s LWR used the -.3 type value.

Given those assumptions, we can of course write:

RC = sS + dD + tT + hHR + wW + xO

Let’s consider “S”, “D”, etc. to be per PA frequencies (again, these events are assumed to encompass all possible PA outcomes, so PA = S + D + T + HR + W + O). If that is the case, we can rewrite O as 1 - S - D - T - HR - W, and write an expression for RC/Out:

RC/O = (sS + dD + tT + hHR + wW + x(1 - S - D - T - HR - W))/(1 - S - D - T - HR - W)

The out term can be canceled out, leaving us with:

RC/O = (sS + dD + tT + hHR + wW)/(1 - S - D - T - HR - W) + x

You can see that there is no need for the out term to be included at all; we are still implicitly including outs, but we don’t need to include them in the equation. The numerator of the expression is the run contribution of each event, excluding outs, while the denominator is outs. This is what I will call rLWR, for run LWR:

rLWR = (sS + dD + tT + hHR + wW)/(1 - S - D - T - HR - W)

In figuring his Linear Weight Ratio, Tango adds an additional wrinkle, and sets the weight of a single equal to 1, with the other weights changing proportionally. We can define s' as 1/s, and use that to define d' = d*s', t' = t*s', etc., and write LWR as:

LWR = (S + d'*D + t'*T + h'*HR + w'*W)/(1 - S - D - T - HR - W)

At this point I’ll plug in some actual numbers from the basic ERP equation I use ((TB + .8H + W - .3AB)*.324). This is not an optimal equation, and that’s okay because my point here is not to present a formula that you should use, just to demonstrate how you can derive your own formula for LWR based on whatever set of linear weights you are using. When that ERP equation is expanded, it becomes:

ERP = .486S + .810D + 1.134T + 1.458HR + .324W - .097(AB - H)

Which yields s’ = 2.058 and the following LWR equation:

LWR = (S + 1.67D + 2.33T + 3HR + .67W)/(1 - S - D - T - HR - W)

If you are using the actual counts of each event rather than the per PA frequencies, this could be written the same except PA would replace 1 in the denominator.

It is easy to convert between LWR and R/O, and it is a linear process. The equations are:

R/O = rLWR + x
rLWR = R/O - x
R/O = LWR/s' + x
LWR = (R/O - x)*s'

What alterations do we have to make to include non-batting outs in our ratio? This can be tricky since we can no longer assume a uniform value for outs across types. But we just need to ensure that the above relationships still hold, and weight the event in the numerator accordingly. (LWR*s' + x)*Outs must equal RC. We can expand that out:

(LWR numerator/Outs*s' + x)*Outs = RC

which simplifies to:

LWR numerator*s' + x*Outs = RC

For any specific event, x is known (the -.097 value), Outs is known (each out is worth one out), s' is known (2.06 in this case), and the RC weight of the event in question in known (let’s say we have CS at an overall value of -.3), so all we need to do is solve for the needed coefficient in the LWR numerator:

(RC weight - x)/s' = LWR numerator

For the CS example:

(-.3 - (-.097))/2.06 ~ = -.1

Friday, April 01, 2016

2016 Predictions

Standard disclaimer applies. Also, I’m giving myself an extra Oreo for every time I can use the phrase "on paper".

AL EAST

1. Boston
2. Toronto (wildcard)
3. New York
4. Tampa Bay
5. Baltimore

I’ve picked the Red Sox to win the AL East in 2015, 2012, 2011, 2009, and 2007 and to win the pennant in 2015, 2012, 2011, 2009, and 2007. I was right in 2007; in 2013, when they won another division and pennant, I picked them to finish third. I guess what I’m trying to say is "David Price, if you’re reading this, don’t put in a pre-order on a duck boat."

To attempt to analyze why I have been so wrong about the Red Sox so frequently would be taking this exercise more seriously than I intend, and would be about me and not baseball, which is of no interest to anyone other than me. So I will leave any questions about whether I qualify for the popular definition of insanity to the reader and instead point out that the crude infrastructure I use to inform these predictions really left me with no choice; I have Boston six wins ahead of anyone else in the AL. Only Toronto projects to score more runs and only Cleveland and New York project to allow fewer. One would think they would have fewer disaster positions and a stronger rotation than in 2015.

At first I was surprised to see Toronto still ranked highly, which is a testament to how unless one reasons this out on paper, it would be easy to overreact to losing a rental pitcher who was only there for two months and forget that one picked them second last year as well and there’s little reason to be more bearish on the team now. Of course, reasoning this out on paper is what leads me to pick the Red Sox all the time.

New York should be right in the mix for the wildcard; if Tanaka and Pineda can somehow stay healthy, they have a sneaky good rotation. I’m not feeling the Tampa Bay love, as their rotation has multiple question marks and their offense is lacking (I don’t think one can count on even a healthy Evan Longoria being a star-level performer). Baltimore should serve as a warning as to how quickly special pleading about outperform Pythagorean and winning one-run games and the like can be forgotten when the team has a bad year. They’re not the cool kids any longer, those guys are in the next division…

AL CENTRAL

1. Detroit
2. Kansas City
3. Cleveland
4. Chicago
5. Minnesota

Everyone, including me, will tell you about how little there is separating most of the AL teams paper. Since I suspected this would be the case before I even sat down to put anything on paper, I decided that I would pick the AL in exactly the order my numerical exercise suggested with one exception--should Cleveland be in playoff position, I would drop them out of it. With the exception (discussed below) of the second wildcard, that is exactly what I have done.

I wouldn’t be surprised if I pick the 2017 Tigers to finish last, but I think they did enough patching this season that a dead cat bounce (see what I did there?) may be in the offing. You can just picture them getting off to a slow start and roaring (I’ll stop now) back behind interim manager Dave Clark or whoever.

Then there are the Royals. On paper I have them with 80 wins, just ahead of the White Sox, but I’ll dutifully jump them over Cleveland just the same. They have outplayed their PW% (W% estimated from runs created and allowed) by 19 games over the past two seasons, which is the seventh-highest total since 2003 (I have figured PW% for my end of season stats back to 2003, not always by the exact same method but this is a case in which the concept is much more important than the specific implementation of it). Their predecessors have generally done well in outplaying PW% again in year 3:



The average year 3 out-performance is 3.7 wins, so let’s be generous and give the Royals four more wins (the sabermetrically-sharp among you probably noticed that this very crude and unendorsed methodology is assuming that these teams’ RC and RC Allowed are consistent with pre-season expectations). That puts them at 84; I have Detroit with 84 and Cleveland with 83.

Does that placate you if you think last year was “the year that Base Runs failed”? Setting aside the ridiculous nature of hanging errors which are created jointly by Base Runs and Pythagenpat solely around the neck of the former, of course one must objectively acknowledge that PW%, whatever reasonable inputs one might use, had a bad year in 2015. A really bad year. The chart gives the RMSE of (W% - PW%) from 2003-2015:



A RMSE of equivalent to 6.66 per 162 games was by far the worst over this period. But note that the previous two seasons were the best over this time period, and that the overall trend, if one can divine one, appears to be stable or improving accuracy over time. So which do you believe--that there’s a possibility worth multiple blog posts about that suddenly, in 2015, all of the underpinnings of run and win estimation and the combination thereof suddenly ceased to work? Or that sometimes the dice roll a little bit differently?

The general discussion of PW% is not specific to Kansas City, of course; the Royals could this season once again outplay their PW% even if the league-wide error returns to normal levels. But if you feel compelled to hedge late in your post by writing the phrase “this is probably just a blip”, it almost certainly is a blip.

The Indians are my team, which is why I won’t pick them to make the playoffs unless I’m really feeling it in addition to seeing it in the objective projections (almost the opposite of how I approached picking the Indians as a younger human being, in which case feeling was the only thing that mattered). The fact that they couldn’t even do something like bring in Austin Jackson for $5 million to help shore up a dreadful looking outfield prevents me from believing that this is their year. Seriously, the opening day outfield is Rajai Davis, Tyler Naquin, and Marlon Byrd backed up by Colin Cowgill. Send money. The White Sox could be in the mix, but I still see a below-average offense with good but not great starters and a mediocre bullpen, even if I really like Carlos Rodon. The Twins certainly have some offensive players to watch, but their multi-season run of bad starting pitching doesn’t seem to be coming to end this year.

AL WEST

1. Houston
2. Seattle (wildcard)
3. Texas
4. Los Angeles
5. Oakland

On paper, the Astros and A’s stand out from the pack in this division; the other three look to be pretty close to me. If I was reading this post, I would stop here, because I have vowed to stop reading any baseball article that uses the term “tank” (all apologies to Dayan Viciedo). But it seems to me that much of the alarmism about the imagined problem of “tanking” stems from the interests and fans of rich teams. Fans of these teams, which could never allow themselves to take a clear step backward to the extent that the Astros or Cubs did, don’t seem to appreciate it when opponents try different approaches to build lasting contenders rather than simply throwing money around trying to reach 85 wins and perpetually hunt for a wildcard berth. I can’t blame them--it would be nice to be a fan of a league in which any clubs that can’t match your financial advantage are forever stuck in the middle. But the easiest thing in the world is to be a fan of a rich team and chastise other teams for winning 60 games every once in a while.

I’m not really sold on the Mariners as a wildcard team, but I have a general rule against picking two wildcards from the same division (even though this is quite possible as the NL Central demonstrated last year) and so I’m not picking the Yankees. On paper, I have the Mariners and the Royals virtually tied, with a slight edge to the former. I can talk myself into believing it--most of the reasons why I and many others liked them last year are still in place, with Jerry Dipoto seemingly doing a nice job of tinkering on the edges of the roster. The same is true of the Rangers, but in the opposite direction. Yes, they now have Hamels, we know Prince Fielder is still alive, and Darvish should be back at some point, but there’s still a reason they were picked last by many in 2015. The Angels are unintentionally going for a stars and scrubs approach, but they only have one star. That he’s the brightest in the firmament is still not enough to make that a winning strategy. The A’s should have been much better last year, but this year may be closer to their actual 2015 record than their predicted one.

NL EAST

1. Washington
2. New York (wildcard)
3. Miami
4. Atlanta
5. Philadelphia

I dislike Dusty Baker’s managing as much as the next guy, probably a bit more since I saw a fair deal of him when he was in Cincinnati. But damned if Matt Williams isn’t one of a very small number of major league managers that I think I’d be willing to replace with Dusty. A player’s manager following the unpopular Williams couldn’t hurt either.

But I don’t think you need to resort to pop psychology in order to think the Nationals are the team to beat in the East. While their roster is not as good on paper as it was last year, they are likely to stay healthier. Even with regression from Harper, significant contributions from Rendon, Ramos, even Daniel Murphy could make this a more productive offense. Their rotation is not Mets-level but it should still be good enough, although the bullpen doesn’t look great. Second half surge and Cespedes resign aside, I see the Mets as an average offense. The most comparable current team is closer to the Indians than to the Cubs. If things go right for the Marlins, this could be their wildcard and World Series year, but thankfully that is often true yet it still has only happened twice. I was surprised that I still had the Braves five games ahead of the Phillies on paper for 2016, although I definitely would take their next five years over the Phillies as well.

NL CENTRAL

1. Chicago
2. Pittsburgh
3. St. Louis
4. Cincinnati
5. Milwaukee

My crude estimates have the Cubs at 96 wins, which is one of the highest figures I can remember. They easily have the best offense in the league on paper, while allowing the same number of runs as the Mets. They are really good, which means they might have a 15% chance of being recognized as such when it’s all over and an 85% chance of being cited as another sad chapter, 1908, 1945 blahblahblah. I have the Cardinals and the Pirates as dead even on paper, both a step behind the second-place teams on the coasts in the wildcard hunt, St. Louis with better defense and Pittsburgh with better offense. If you believe in Searage magic, that may be reason to go with the latter; I learned today that Cory Luebke made the Pirates pen and I flipped them from how I’d originally written this. Scientific process right here. As with the NL East, I was surprised that I have the Reds five games ahead of the Brewers. As with the NL East, I don’t think it matters much one way or the other.

NL WEST

1. San Francisco
2. Los Angeles (wildcard)
3. Arizona
4. Colorado
5. San Diego

On paper I have the Dodgers six games ahead of the Giants, so it was probably a foolhardy move to flip them here. But the Dodgers have some serious questions about the health of their (otherwise very solid) rotation and enough nagging injuries to position players that I’m leaning Giants. That might be just as well for the Dodgers if they could supplement from their wealth throughout the season and flip the 2014 script on their rivals. Another reason I’m ignoring that six game gap is my number have San Francisco with an average offense and I expect the team that led the majors in park-adjusted OBA last season will retain a little more production than that (even if I too am skeptical of Matt Duffy, Brandon Crawford, Joe Panek: Super Infield!). A lot of the mainstream prognostications I’ve encountered have stated as a given that the Diamondbacks have an excellent offense and just needed to shore up their pitching. But while they were third in the NL in park-adjusted RC/G in 2015, the two teams they trailed by .34 and .14 runs respectively are the two I’m picking ahead of them in the NL West. I guess maybe people really believe in Patrick Corbin’s elbow and Robbie Ray? This makes it three-for-three NL divisions where I’m surprised to have the fourth-place team so far ahead of the last place team on paper (and, except in the case of the East, surprised to have them ahead at all). But I’ll stick with the Rockies over the Padres despite my misgivings.

WORLD SERIES

Chicago (N) over Boston

Just twelve years ago, such a World Series matchup would have conjured up mixed emotions and platitudes of “at least one of them will finally get to win”. This year, everyone outside of New England and St. Louis would be united in singing “Go Cubs Go”. Remember how interesting this could have been when it’s actually Royals/Marlins or something disgusting.

AL Rookie of the Year: 1B AJ Reed, HOU
AL Cy Young: Carlos Carrasco, CLE
AL MVP: OF Mookie Betts, BOS
NL Rookie of the Year: SS Trevor Story, COL
NL Cy Young: Stephen Strasburg, WAS
NL MVP: C Buster Posey, SF