Walk Like a Sabermetrician: October 2021

Wednesday, October 20, 2021

Rate Stat Series, pt. 14: Relativity for the Theoretical Team Framework

Before jumping into win-equivalent rate stats for the theoretical team framework, I think it would be helpful to re-do our theoretical team calculations on a purely rate basis. This is, after all, a rate stat series. In discussing the TT framework in pts. 9-11, I started by using the player’s PA to define the PA of the team, as Bill James chose to do with his TT Runs Created. This allowed our initial estimate of runs created or RAA to remain grounded in the player’s actual season.

An alternative (and as we will see, equivalent) approach would be to eschew all of the “8*PA” and just express everything in rates to begin with. When originally discussing TT, I didn’t show it that way, but maybe I should have. I found that my own thinking when trying to figure out the win equivalent TT rates was greatly aided by walking through this process first.

Again, everything is equivalent to what we did before – if you just divide a lot of those equations by PA, you will get to the same place a lot quicker than I’m going to. The theoretical team framework we’re working with assumes that the batter gets 1/9 of the PA for the theoretical team. It’s also mathematically true that for Base Runs:

BsR/PA = (A*B/(B+C) + D)/PA = (A/PA)*(B/PA)/(B/PA + C/PA) + D/PA

If for the sake of writing formulas we rename A/PA as ROBA (Runners On Base Average), B/PA as AF (Advancement Factor; I’ve been using this abbreviation long before it came into mainstream usage in other contexts), C/PA as OA (Out Average), and D/PA as HRPA (Home Runs/PA), we can then write:

BsR/PA = ROBA*AF/(AF + OA) + HRPA

Since it is also true that R/O = R/PA/(1 – OBA), in this case it is true that:

BsR/O = (BsR/PA)/OA

We can use these equations to calculate the Base Runs per out for a theoretical team (I’m going to skip over “reference team” notation and just assume that the reference team is a league average team):

TT_ROBA = 1/9*ROBA + 8/9*LgROBA

TT_AF = 1/9*AF + 8/9*LgAF

TT_OA = 1/9*OA + 8/9*LgOA

TT_HRPA = 1/9*HRPA + 8/9*LgHRPA

TT_BsR/PA = TT_ROBA*TT_AF/(TT_AF + TT_OA) + TT_HRPA

TT_BsR/O = (TT_ROBA*TT_AF/(TT_AF + TT_OA) + TT_HRPA)/TT_OA

Here’s a sample calculation for 1994 Frank Thomas:

To calculate a win-equivalent rate stat, we can use the TT_BsR/O figure as a starting point (it suggests that a theoretical team of 1/9 Thomas and 8/9 league average would score .2344 runs/out). We don’t need to go through this additional calculation, though; when we calculated R+/O+ (or R+/PA+, RAA+/O+, or RAA+/PA+), we already had everything we needed for this calculation.

You will see if you do the math that:

TT_BsR/O = (RAA+/O+)*(1/9) + LgR/O

TT_BsR/O = (R+/O+)*(1/9) + (LgR/O)*(8/9)

TT_BsR/O = (RAA+/PA+)/(1 – LgOBA)*(1/9) + LgR/O

TT_BsR/O = (R+/PA+)/(1 - LgOBA)*(1/9) + (LgR/O)*(8/9)

You could view this as a validation of the R+/O+ approach, as it does what it set out to do, which is to isolate the batter’s contribution to the theoretical team’s runs/out. Once we’ve established the team’s runs/out, it is pretty simple to convert to wins. I will just give formulas as I think they are pretty self-explanatory:

TT_BsR/G = TT_BsR/O*LgO/G

TT_RPG = TT_BsR/G + LgR/G

TT_x = TT_RPG^.29

TT_W% = (TT_BsR/G)^TT_x/((TT_BsR/G)^TT_x + (LgR/G)^TT_x)

Walking through this for the Franks, we have:

One thing to note here is that if we look at the theoretical team’s R/O (or R/G) relative to the league average, subtract one, multiply by nine, and add one back in, we will Thomas and Robinson’s relative R+/O+. This is not a surprising result given what we saw above regarding the relationship between R+/O+ and theoretical team R/O.

We now have a W% for the theoretical team, which we could leave alone as a rate stat, but it’s not very satisfying to me to have an individual rate stat expressed as a team W%. If we subtract .5, we have WAA/Team G; we could interpret this as meaning that Thomas is estimated to add .0609 wins per game and Robinson .0546 to a theoretical team on which they get 1/9 of PA. Another option would be to convert this WAA back to a total, defining “games” as PA/Lg(PA/G), and then we could have WAA+/PA+ or WAA+/O+ as rates.

In keeping with the general format established in this series, though, my final answer for a win-equivalent rate stat for the TT framework will be to convert the winning percentage (actually, we’ll use win ratio since it makes the math easier) back to the reference environment, and calculate a relative adjusted R+/O+. Since everything will be on an outs basis (as we’re using O+), we don’t need to worry about league PA/G when calculating our relative adjusted R+/O+.

Instead of calculating TT_W%, we could have left it in the form of team win ratio:

TT_WR = ((TT_BsR/G)/(LgR/G))^(TT_x)

We can convert this back to an equivalent run ratio in the reference environment (which for this series we’ve defined as having Pythagorean exponent r = 1.881) by solving for AdjTT_RR in the equation:

TT_WR = AdjTT_RR^r

AdjTT_RR = TT_WR^(1/r)

We could convert this run ratio back to a team runs/game in the reference environment, and then to a team runs/out, and then use our equation for tying individual R+/O+ to theoretical team R/O to get an equivalent R+/O+ ratio. But why bother with all that, when we will just end up dividing it by the reference environment R/O to get our relative adjusted R+/O+? I noted above that there was a direct relationship between the theoretical team’s run ratio (which is equal to the theoretical team’s R/O divided by league R/O) and the batter’s relative R+/O+:

Rel R+/O+ = (TT_RR – 1)*9 + 1

So our Relative Adjusted R+/O+ can be calculated as:

RelAdj R+/O+ = (AdjTT_RR - 1)*9 + 1

I brought back our original relative R+/O+ (prior to going through the win-equivalent math) for comparison. Thomas gains slightly and Robinson loses more, because the value of his relative runs is lower in a high scoring environment. This is a similar conclusion to what we saw when comparing relative R+/PA and the relative adjusted R+/PA for Robinson and Thomas. Nominal runs are more valuable when the run scoring environment is lower, because it takes fewer marginal runs to create a marginal win. Relative runs are more valuable when the scoring environment is higher, because the win ratio expected to result from a given run ratio increases due to the higher Pythagenpat exponent.

At this point, we have exhausted my thoughts and ideas concerning the theoretical issues in designing individual batter rate stats. Next time I will discuss mixing up our rate stats and the frameworks within which I assert each should ideally be used.

Tuesday, October 05, 2021

End of Season Statistics, 2021

While this edition of End of Season Statistics will more closely resemble the reports I published through 2019 than the 2020 edition did, there are still a number of issues created by the revised rules, particularly the extra innings rule and seven-inning doubleheaders. Seeing as that both of these changes could be walked back for 2022, I have not attempted to revise my approach to take them into account – should they become permanent, then and only then will I invest time in trying to make the necessary adjustments (some of which I outlined here) to fit the data they produce within traditional sabermetric structures.

In the mean time, there will be three main consequences of the rules:

1) While I will provide relief pitcher statistics this year, I will base the value metrics on eRA rather than RA or ERA. RA is hopelessly polluted by the Manfred runners, while ERA is hopelessly polluted by virtue of being ERA. While I would prefer to base the value metric on a measure that reflects runs actually allowed, they are messy enough to begin with in the case of relievers that I am not too concerned about it.

2) When computing value metrics for starting pitchers, I will be comparing their RRA to the estimated league average RA rather than the actual one, since the latter is polluted by Manfred runners even though starters’ statistics themselves are immune from the impact.

3) Team run per game metrics will be expressed per 9 innings (27 outs), and a per 9 innings approach will be used to calculate expected winning percentages. As such, these will not exactly be an attempt to estimate what any given team’s W% should have been, but rather a theoretical estimate of what their W% would have been had they been playing under normal rules. For actual runs and runs allowed, these will still be distorted by Manfred runners, but accounting for that is again much more trouble than a (hopefully) two-year interlude justifies.

The data comes from a number of different sources. Most of the data comes from Baseball-Reference; I will try to note exceptions as they come up.

The basic philosophy behind these stats is to use the simplest methods that have acceptable accuracy. Of course, "acceptable" is in the eye of the beholder, namely me. I use Pythagenpat not because other run/win converters, like a constant RPW or a fixed exponent are not accurate enough for this purpose, but because it's mine and it would be kind of odd if I didn't use it.

If I seem to be a stickler for purity in my critiques of others' methods, I'd contend it is usually in a theoretical sense, not an input sense. So when I exclude hit batters, I'm not saying that hit batters are worthless or that they *should* be ignored; it's just easier not to mess with them and not that much less accurate (note: hit batters are actually included in the offensive statistics now).

I also don't really have a problem with people using sub-standard methods (say, Basic RC) as long as they acknowledge that they are sub-standard. If someone pretends that Basic RC doesn't undervalue walks or cause problems when applied to extreme individuals, I'll call them on it; if they explain its shortcomings but use it regardless, I accept that. Take these last three paragraphs as my acknowledgment that some of the statistics displayed here have shortcomings as well, and I've at least attempted to describe some of them in the discussion below.

The League spreadsheet is pretty straightforward--it includes league totals and averages for a number of categories, most or all of which are explained at appropriate junctures throughout this piece. The advent of interleague play has created two different sets of league totals--one for the offense of league teams and one for the defense of league teams. Before interleague play, these two were identical. I do not present both sets of totals (you can figure the defensive ones yourself from the team spreadsheet, if you desire), just those for the offenses. The exception is for the defense-specific statistics, like innings pitched and quality starts. The figures for those categories in the league report are for the defenses of the league's teams. However, I do include each league's breakdown of basic pitching stats between starters and relievers (denoted by "s" or "r" prefixes), and so summing those will yield the totals from the pitching side.

I added a column this year for “ActO”, which is actual (rather than estimated) outs made by the team offensively. This can be determined from the official statistics as PA – R – LOB. I have then replaced the column I usually show for league R/G (“N”) with R/9, which is actually R*27/ActO, which is equivalent to R*9/IP. This restates the league run average in the more familiar per nine innings. I’ve done the same for “OG”, which is Outs/Game but only for those outs I count in the individual hitter’s stats (AB – H + CS) ,“PA/G”, which is normally just (AB + W)/G, and “KG” and “WG” (normally just K/G and W/G) – these are now “O/9”, “PA/9”, still “KG”/”WG” and are per 27 actual outs.

The Team spreadsheet focuses on overall team performance--wins, losses, runs scored, runs allowed. The columns included are: Park Factor (PF), Winning Percentage (W%), Expected W% (EW%), Predicted W% (PW%), wins, losses, runs, runs allowed, Runs Created (RC), Runs Created Allowed (RCA), Home Winning Percentage (HW%), Road Winning Percentage (RW%) [exactly what they sound like--W% at home and on the road], R/9, RA/9, Runs Created/9 (RC/9), Runs Created Allowed/9 (RCA/9), and Runs Per Game (the average number of runs scored an allowed per game). For the offensive categories, runs/9 are based on runs per 27 actual outs; for pitching categories, they are runs/9 innings.

I based EW% and PW% on R/9 and RA/9 (and RC/9 and RCA/9) rather than the actual runs totals. This means that what they are not estimating what a team’s winning percentage should have been in the actual game constructions that they played, but what they should have been if playing nine inning games but scoring/allowing runs at the same rate per inning. EW%, which is based on actual R and RA, is also polluted by inflated runs in extra inning games; PW%, which is based on RC and RCA, doesn’t suffer from this distortion.

The runs and Runs Created figures are unadjusted, but the per-game averages are park-adjusted, except for RPG which is also raw. Runs Created and Runs Created Allowed are both based on a simple Base Runs formula. The formula is:

A = H + W - HR - CS

B = (2TB - H - 4HR + .05W + 1.5SB)*.76

C = AB - H

D = HR

Naturally, A*B/(B + C) + D.

Park factors are based on five years of data when applicable (so 2017 - 2021), include both runs scored and allowed, and they are regressed towards average (PF = 1), with the amount of regression varying based on the number of games in total in the sample. There are factors for both runs and home runs. The initial PF (not shown) is:

iPF = (H*T/(R*(T - 1) + H) + 1)/2

where H = RPG in home games, R = RPG in road games, T = # teams in league (14 for AL and 16 for NL). Then the iPF is converted to the PF by taking x*iPF + (1-x), where x = .1364*ln(G/162) + .5866. I will expound upon how this formula was derived in a future post.

It is important to note, since there always seems to be confusion about this, that these park factors already incorporate the fact that the average player plays 50% on the road and 50% at home. That is what the adding one and dividing by 2 in the iPF is all about. So if I list Fenway Park with a 1.02 PF, that means that it actually increases RPG by 4%.

In the calculation of the PFs, I did not take out “home” games that were actually at neutral sites (of which there were a rash in 2020). The Blue Jays multiple homes make things very messy, so I just used their 2021 data only.

There are also Team Offense and Defense spreadsheets. These include the following categories:

Team offense: Plate Appearances, Batting Average (BA), On Base Average (OBA), Slugging Average (SLG), Secondary Average (SEC), Walks and Hit Batters per At Bat (WAB), Isolated Power (SLG - BA), R/G at home (hR/G), and R/G on the road (rR/G) BA, OBA, SLG, WAB, and ISO are park-adjusted by dividing by the square root of park factor (or the equivalent; WAB = (OBA - BA)/(1 - OBA), ISO = SLG - BA, and SEC = WAB + ISO).

Team defense: Innings Pitched, BA, OBA, SLG, Innings per Start (IP/S), Starter's eRA (seRA), Reliever's eRA (reRA), Quality Start Percentage (QS%), RA/G at home (hRA/G), RA/G on the road (rRA/G), Battery Mishap Rate (BMR), Modified Fielding Average (mFA), and Defensive Efficiency Record (DER). BA, OBA, and SLG are park-adjusted by dividing by the square root of PF; seRA and reRA are divided by PF.

The three fielding metrics I've included are limited it only to metrics that a) I can calculate myself and b) are based on the basic available data, not specialized PBP data. The three metrics are explained in this post, but here are quick descriptions of each:

1) BMR--wild pitches and passed balls per 100 baserunners = (WP + PB)/(H + W - HR)*100

2) mFA--fielding average removing strikeouts and assists = (PO - K)/(PO - K + E)

3) DER--the Bill James classic, using only the PA-based estimate of plays made. Based on a suggestion by Terpsfan101, I've tweaked the error coefficient. Plays Made = PA - K - H - W - HR - HB - .64E and DER = PM/(PM + H - HR + .64E)

Next are the individual player reports. I defined a starting pitcher as one with 15 or more starts. All other pitchers are eligible to be included as a reliever. If a pitcher has 40 appearances, then they are included. Additionally, if a pitcher has 50 innings and less than 50% of his appearances are starts, he is also included as a reliever (this allows some swingmen type pitchers who wouldn’t meet either the minimum start or appearance standards to get in). This would be a good point to note that I didn't do much to adjust for the opener--I made some judgment calls (very haphazard judgment calls) on which bucket to throw some pitchers in. This is something that I should definitely give some more thought to in coming years.

For all of the player reports, ages are based on simply subtracting their year of birth from 2021. I realize that this is not compatible with how ages are usually listed and so “Age 27” doesn’t necessarily correspond to age 27 as I list it, but it makes everything a heckuva lot easier, and I am more interested in comparing the ages of the players to their contemporaries than fitting them into historical studies, and for the former application it makes very little difference. The "R" category records rookie status with a "R" for rookies and a blank for everyone else. Also, all players are counted as being on the team with whom they played/pitched (IP or PA as appropriate) the most.

For relievers, the categories listed are: Games, Innings Pitched, estimated Plate Appearances (PA), Run Average (RA), Relief Run Average (RRA), Earned Run Average (ERA), Estimated Run Average (eRA), DIPS Run Average (dRA), Strikeouts per Game (KG), Walks per Game (WG), Guess-Future (G-F), Inherited Runners per Game (IR/G), Batting Average on Balls in Play (%H), Runs Above Average (RAA), and Runs Above Replacement (RAR).

IR/G is per relief appearance (G - GS); it is an interesting thing to look at, I think, in lieu of actual leverage data. You can see which closers come in with runners on base, and which are used nearly exclusively to start innings. Of course, you can’t infer too much; there are bad relievers who come in with a lot of people on base, not because they are being used in high leverage situations, but because they are long men being used in low-leverage situations already out of hand.

For starting pitchers, the columns are: Wins, Losses, Innings Pitched, Estimated Plate Appearances (PA), RA, RRA, ERA, eRA, dRA, KG, WG, G-F, %H, Pitches/Start (P/S), Quality Start Percentage (QS%), RAA, and RAR. RA and ERA you know--R*9/IP or ER*9/IP, park-adjusted by dividing by PF. The formulas for eRA and dRA are based on the same Base Runs equation and they estimate RA, not ERA.

* eRA is based on the actual results allowed by the pitcher (hits, doubles, home runs, walks, strikeouts, etc.). It is park-adjusted by dividing by PF.

* dRA is the classic DIPS-style RA, assuming that the pitcher allows a league average %H, and that his hits in play have a league-average S/D/T split. It is park-adjusted by dividing by PF.

The formula for eRA is:

A = H + W - HR

B = (2*TB - H - 4*HR + .05*W)*.78

C = AB - H = K + (3*IP - K)*x (where x is figured as described below for PA estimation and is typically around .93) = PA (from below) - H - W

eRA = (A*B/(B + C) + HR)*9/IP

To figure dRA, you first need the estimate of PA described below. Then you calculate W, K, and HR per PA (call these %W, %K, and %HR). Percentage of balls in play (BIP%) = 1 - %W - %K - %HR. This is used to calculate the DIPS-friendly estimate of %H (H per PA) as e%H = Lg%H*BIP%.

Now everything has a common denominator of PA, so we can plug into Base Runs:

A = e%H + %W

B = (2*(z*e%H + 4*%HR) - e%H - 5*%HR + .05*%W)*.78

C = 1 - e%H - %W - %HR

cRA = (A*B/(B + C) + %HR)/C*a

z is the league average of total bases per non-HR hit (TB - 4*HR)/(H - HR), and a is the league average of (AB - H) per game.

Also shown are strikeout and walk rate, both expressed as per game. By game I mean not nine innings but rather the league average of PA/G. I have always been a proponent of using PA and not IP as the denominator for non-run pitching rates, and now the use of per PA rates is widespread. Usually these are expressed as K/PA and W/PA, or equivalently, percentage of PA with a strikeout or walk. I don’t believe that any site publishes these as K and W per equivalent game as I am here. This is not better than K%--it’s simply applying a scalar multiplier. I like it because it generally follows the same scale as the familiar K/9.

To facilitate this, I’ve finally corrected a flaw in the formula I use to estimate plate appearances for pitchers. Previously, I’ve done it the lazy way by not splitting strikeouts out from other outs. I am now using this formula to estimate PA (where PA = AB + W):

PA = K + (3*IP - K)*x + H + W

Where x = league average of (AB - H - K)/(3*IP - K)

Then KG = K*Lg(PA/G) and WG = W*Lg(PA/G).

G-F is a junk stat, included here out of habit because I've been including it for years. It was intended to give a quick read of a pitcher's expected performance in the next season, based on eRA and strikeout rate. Although the numbers vaguely resemble RAs, it's actually unitless. As a rule of thumb, anything under four is pretty good for a starter. G-F = 4.46 + .095(eRA) - .113(K*9/IP). It is a junk stat. JUNK STAT JUNK STAT JUNK STAT. Got it?

%H is BABIP, more or less--%H = (H - HR)/(PA - HR - K - W), where PA was estimated above. Pitches/Start includes all appearances, so I've counted relief appearances as one-half of a start (P/S = Pitches/(.5*G + .5*GS). QS% is just QS/(G - GS); I don't think it's particularly useful, but Doug's Stats include QS so I include it.

I've used a stat called Relief Run Average (RRA) in the past, based on Sky Andrecheck's article in the August 1999 By the Numbers; that one only used inherited runners, but I've revised it to include bequeathed runners as well, making it equally applicable to starters and relievers. One thing that's become more problematic as time goes on for calculating this expanded metric is the sketchy availability of bequeathed runner data for relievers. As a result, only bequeathed runners left by starters (and "relievers" when pitching as starters) are taken into account here. I use RRA as the building block for baselined value estimates for all pitchers. I explained RRA in this article, but the bottom line formulas are:

BRSV = BRS - BR*i*sqrt(PF)

IRSV = IR*i*sqrt(PF) - IRS

RRA = ((R - (BRSV + IRSV))*9/IP)/PF

Given the difficulties of looking at the league average of actual runs due to Manfred rules, I decided to use eRA to calculate the baselined metrics for relievers. So they are no longer based on actual runs allowed by the pitcher, but rather on the component statistics. For starters, I will use the actual runs allowed in the form of RRA, but compared to the league average eRA. Starters’ statistics are not influenced by the Manfred runners, but the league average RA is still artificially inflated by them, so the league eRA should be a better measure of what the league average RRA would be in lieu of Manfred runners. I say “should” as this assumes that the eRA formula is properly calibrated, and it’s hard to calibrate any runs created formula when you don’t know what the league average runs should be. I remain unconvinced that most saberemtricians have fully grasped all of the implications of the Manfred runners on the 2020-2021 statistics, and if these rules are maintained going forward it will require much more effort to maintain basic sabermetric measures. In any event, the RAA/RAR formulas I’m using are:

RAA (relievers) = (.951*Lg(eRA) - eRA)*IP/9

RAA (starters) = (1.025*Lg(eRA) - eRA)*IP/9

RAR (relievers) = (1.11*Lg(eRA) - RRA)*IP/9

RAR (starters) = (1.28*Lg(eRA) - RRA)*IP/9

All players with 250 or more plate appearances (official, total plate appearances) are included in the Hitters spreadsheets (along with some players close to the cutoff point who I was interested in). Each is assigned one position, the one at which they appeared in the most games. The statistics presented are: Games played (G), Plate Appearances (PA), Outs (O), Batting Average (BA), On Base Average (OBA), Slugging Average (SLG), Secondary Average (SEC), Runs Created (RC), Runs Created per Game (RG), Speed Score (SS), Hitting Runs Above Average (HRAA), Runs Above Average (RAA), and Runs Above Replacement (RAR).

Starting in 2015, I'm including hit batters in all related categories for hitters, so PA is now equal to AB + W+ HB. Outs are AB - H + CS. BA and SLG you know, but remember that without SF, OBA is just (H + W + HB)/(AB + W + HB). Secondary Average = (TB - H + W + HB)/AB = SLG - BA + (OBA - BA)/(1 - OBA). I have not included net steals as many people (and Bill James himself) do, but I have included HB which some do not.

BA, OBA, and SLG are park-adjusted by dividing by the square root of PF. This is an approximation, of course, but I'm satisfied that it works well (I plan to post a couple articles on this some time during the offseason). The goal here is to adjust for the win value of offensive events, not to quantify the exact park effect on the given rate. I use the BA/OBA/SLG-based formula to figure SEC, so it is park-adjusted as well.

Runs Created is actually Paul Johnson's ERP, more or less. Ideally, I would use a custom linear weights formula for the given league, but ERP is just so darn simple and close to the mark that it’s hard to pass up. I still use the term “RC” partially as a homage to Bill James (seriously, I really like and respect him even if I’ve said negative things about RC and Win Shares), and also because it is just a good term. I like the thought put in your head when you hear “creating” a run better than “producing”, “manufacturing”, “generating”, etc. to say nothing of names like “equivalent” or “extrapolated” runs. None of that is said to put down the creators of those methods--there just aren’t a lot of good, unique names available.

For 2015, I refined the formula a little bit to:

1. include hit batters at a value equal to that of a walk

2. value intentional walks at just half the value of a regular walk

3. recalibrate the multiplier based on the last ten major league seasons (2005-2014)

This revised RC = (TB + .8H + W + HB - .5IW + .7SB - CS - .3AB)*.310

RC is park adjusted by dividing by PF, making all of the value stats that follow park adjusted as well. RG, the Runs Created per Game rate, is RC/O*26. For a very long time, dating back to the Jamesian era, 25.5 has been a good approximation for the number of (AB – H + CS) per game, but it has been creeping up, and per 9 innings this year it was right around 26, so I am using that value now.

I do not believe that outs are the proper denominator for an individual rate stat, but I also do not believe that the distortions caused are that bad. (I still intend to finish my rate stat series and discuss all of the options in excruciating detail, but alas you’ll have to take my word for it now).

Several years ago I switched from using my own "Speed Unit" to a version of Bill James' Speed Score; of course, Speed Unit was inspired by Speed Score. I only use four of James' categories in figuring Speed Score. I actually like the construct of Speed Unit better as it was based on z-scores in the various categories (and amazingly a couple other sabermetricians did as well), but trying to keep the estimates of standard deviation for each of the categories appropriate was more trouble than it was worth.

Speed Score is the average of four components, which I'll call a, b, c, and d:

a = ((SB + 3)/(SB + CS + 7) - .4)*20

b = sqrt((SB + CS)/(S + W))*14.3

c = ((R - HR)/(H + W - HR) - .1)*25

d = T/(AB - HR - K)*450

James actually uses a sliding scale for the triples component, but it strikes me as needlessly complex and so I've streamlined it. He looks at two years of data, which makes sense for a gauge that is attempting to capture talent and not performance, but using multiple years of data would be contradictory to the guiding principles behind this set of reports (namely, simplicity. Or laziness. You're pick.) I also changed some of his division to mathematically equivalent multiplications.

The baselined stats are calculated in the same basic manner the pitcher stats are, using the league average RG:

HRAA = (RG – LgRG)*O/26

RAA = (RG – LgRG*PADJ)*O/26

RAR = (RG – LgRG*PADJ*.73)*O/26

PADJ is the position adjustment, based on 2010-2019 offensive data. For catchers it is .92; for 1B/DH, 1.14; for 2B, .99; for 3B, 1.07; for SS, .95; for LF/RF, 1.09; and for CF, 1.05. As positional flexibility takes hold, fielding value is better quantified, and the long-term evolution of the game continues, it's right to question whether offensive positional adjustments are even less reflective of what we are trying to account for than they were in the past. But while I do not claim that the relationship is or should be perfect, at the level of talent filtering that exists to select major leaguers, there should be an inverse relationship between offensive performance by position and the defensive responsibilities of the position. Not a perfect one, but a relationship nonetheless. An offensive positional adjustment than allows for a more objective approach to setting a position adjustment. Again, I have to clarify that I don’t think subjectivity in metric design is a bad thing - any metric, unless it’s simply expressing some fundamental baseball quantity or rate (e.g. “home runs” or “on base average”) is going to involve some subjectivity in design (e.g linear or multiplicative run estimator, any myriad of different ways to design park factors, whether to include a category like sacrifice flies that is more teammate-dependent).

The replacement levels I have used here are very much in line with the values used by other sabermetricians. This is based both on my own "research", my interpretation of other's people research, and a desire to not stray from consensus and make the values unhelpful to the majority of people who may encounter them.

Replacement level is certainly not settled science. There is always going to be room to disagree on what the baseline should be. Even if you agree it should be "replacement level", any estimate of where it should be set is just that--an estimate. Average is clean and fairly straightforward, even if its utility is questionable; replacement level is inherently messy. So I offer the average baseline as well.

For position players, replacement level is set at 73% of the positional average RG (since there's a history of discussing replacement level in terms of winning percentages, this is roughly equivalent to .350). For starting pitchers, it is set at 128% of the league average RA (.380), and for relievers it is set at 111% (.450).

The spreadsheets are published as Google Spreadsheets, which you can download in Excel format by changing the extension in the address from "=html" to "=xlsx", or in open format as "=ods", or in csv as "=csv". That way you can download them and manipulate things however you see fit.

Monday, October 04, 2021

Crude Playoff Odds--2021

These are very simple playoff odds, based on my crude rating system for teams using an equal mix of W%, EW% (based on R/RA), PW% (based on RC/RCA), and 69 games of .500. They account for home field advantage by assuming a .500 team wins 54.2% of home games (major league average 2006-2015). They assume that a team's inherent strength is constant from game-to-game. They do not generally account for any number of factors that you would actually want to account for if you were serious about this, including but not limited to injuries, the current construction of the team rather than the aggregate seasonal performance, pitching rotations, estimated true talent of the players, etc.

The CTRs that are fed in are:

Wildcard game odds (the least useful since the pitching matchups aren’t taken into account, and that matters most when there is just one game):

DS:

LCS:

World Series:

Everything combined:

If the Dodgers win the wildcard game, they become the World Series favorites at 19.9%, with the Giants falling to 19.0%; the Rays fall to 16.1%, so the Dodgers don't have a huge impact on the AL odds (the Dodgers are given a 32.6% to win the pennant should they win the wildcard game). If the Cardinals win, the Giants jump to 25.9% and the Rays to 17.6% (of course the Giants benefit greatly because they have an estimated 68% chance to beat the Cardinals in the NLDS but only 50% to beat the Dodgers). Ranging into the realm of the subjective, I personally favor Houston to win the AL pennant and think Milwaukee will benefit from concentrating innings in front-line pitchers (even sans Devin Williams) and from being on the opposite side of the bracket from the NL West.

I don't have the energy for a rant about what a preposterous proposition it is to make the Dodgers play the Cardinals, or about how the much-hyped four-way battle for the AL wildcard yesterday would never happen under Rob Manfred's desired system (or how if it did it would be between teams struggling to reach .500). The playoffs simultaneously manage to be one of the best and worst things about baseball, and every expansion will serve to enhance the latter.

Walk Like a Sabermetrician

Wednesday, October 20, 2021

Rate Stat Series, pt. 14: Relativity for the Theoretical Team Framework

Tuesday, October 05, 2021

End of Season Statistics, 2021

Monday, October 04, 2021

Crude Playoff Odds--2021

Me, Elsewhere

Analysis Links

Reference Links

Blog Archive

OSU Baseball

End of Season Statistics

Win Shares Walkthrough

NL 1876-1881 Series

Labels

About Me