What can we predict with MiLB Numbers?

Jackson Del Rosario
10 min readDec 11, 2021
Nick Pratto — MLB

If you’ve ever looked into Minor League Baseball numbers you’ll notice it's a bit incomplete. It has all the fun surface-level stats we look for, but missing the important bits. No batted ball data, little plate discipline numbers, and nothing on pitchers (that I care about). Of course, that is only what is available to the public so all the pro teams have the data but it kinda sucks for us independent folk. I would love to see say Nick Yorke’s 90th Percentile Exit Velo, but oh well.

So let's work with what we do have. How accurately can we predict Major League success from pure Minor League Results Data? The answer is just barely, but let's go into why:

Throughout this article, I’ll show my code in R for this model, and I’ll also link to the GitHub for it at the end with the data sources. Doing this for two reasons, one is if you see glaring errors in my code or ways I can improve, please do me the solid and tell me! I also want to try to help people just getting into coding with some free code to jump off of!

Methodology:

So we start by downloading all the minor league data that is helpful to us via Fangraphs. We took the data from 2009–2021 (Omititng 2020 of course because there was no MiLB season). You have to download two different CSVs, one for the advanced and basic stats. I then filter out age, minimum plate appearances and make the ‘Level Variable’ numeric:

# Packages ----
require(tidyverse)
require(xgboost)
require(caTools)
##########################
#Download Your Data Here!#
##########################
# Basic MiLB Stats
milb$Level[milb$Level == "DSL"] <- 1
milb$Level[milb$Level == "CPX"] <- 2
milb$Level[milb$Level == "R"] <- 3
milb$Level[milb$Level == "A-"] <- 4
milb$Level[milb$Level == "A"] <- 5
milb$Level[milb$Level == "A+"] <- 6
milb$Level[milb$Level == "AA"] <- 7
milb$Level[milb$Level == "AAA"] <- 8
# Advanced MiLB Stats
milb2$Level[milb2$Level == "DSL"] <- 1
milb2$Level[milb2$Level == "CPX"] <- 2
milb2$Level[milb2$Level == "R"] <- 3
milb2$Level[milb2$Level == "A-"] <- 4
milb2$Level[milb2$Level == "A"] <- 5
milb2$Level[milb2$Level == "A+"] <- 6
milb2$Level[milb2$Level == "AA"] <- 7
milb2$Level[milb2$Level == "AAA"] <- 8
# Filtering Basic
mistats1 <- milb %>%
filter(PA >= 150) %>%
filter(Age < 26) %>%
mutate(BB = BB. * 100,
K = K. * 100,
K_BB = K-BB, # K Minus BB %
wRC_plus = wRC.) %>%
select(Name, Level, Age, Season, PA, K_BB, ISO, Spd, wRC.)
# Filtering Advanced
mistats2 <- milb2 %>%
filter(Age < 26) %>%
filter(PA >= 150) %>%
select(Name, Level, Age, Season, PA, SwStr., BABIP)
#Combine
mistats <- left_join(mistats1, mistats2, by = c("Name", "Level", "PA", "Season", "Age"))

This gives us a data frame of all the MiLB stats we want, along with them all (besides name) being numeric which is important for later.

Also, I recommend using the code above as a reference for the Level Variable as until the end it will be treated as 0–8

We’ll then do the same thing for Big League Data since 2017 (min 400 PAs) as I felt that time frame allowed for the most MiLB players to make it and remain in the big leagues, while also giving a broad enough sample size to encapsulate true talent.

Again we download via Fangraphs and manipulate using the following:

mlbstats <- mlb %>% 
filter(PA >= 400) %>%
mutate(Name = ï..Name) %>% # Only run this if you encode and it comes out like this
mutate(off = (Off / PA) * 100) %>%
select(Name, off)

This also gives us our comparison variable, the FanGraphs Stat “Off”, which you can read about here. It is basically offensive runs above average. I turned it into a rate stat, dividing it by Plate Appearances, and then multiplying by 100 so it's a little easier to work with.

The reason I did not use wRC+ is that it lacks baserunning and for many players that is the sole reason they even get called up and how they accrue much of their value.

We then join the MiLB and MLB numbers into one ‘Mega Data Frame’

stats <- left_join(mistats, mlbstats, by = c("Name"))

We then come to the problem with no answer. How do we value players who never reached the Big Leagues? I’d love to hear other inputs on this question, but I chose to give them each a 10% Percentile offensive runs above average, but again very open here. This is done through:

non_mlb <- quantile(stats$off, na.rm = T, .1)stats$off <- ifelse(is.na(stats$off) == T, non_mlb, stats$off)

We then want to filter out 2017–2021 data for our training set. We do this because most of those players have not hit the big leagues yet and their 10%tile Off will skew the data.

stats <- stats %>% 
filter(Season != 2019 | Season != 2021 | Season != 2018 | Season != 2017) %>%
select(-Season) # Dont need the season var anymore

Then for some extra cleaning, we need to make sure everything is numeric and void of any N/As in our data:

# Making Everything Numericstats$Level <- as.numeric(stats$Level)# Dont Want Missing Values
stats <- stats %>%
select(-Name) %>%
filter(!is.na(Level),
!is.na(PA),
!is.na(K_BB),
!is.na(ISO),
!is.na(Spd),
!is.na(wRC.), #wRC+
!is.na(SwStr.),
!is.na(off),
!is.na(BABIP))

Now that the data is clean, prepped, and ready we can make a model:

For our model, we will be using my favorite Machine Learning Model, XGBoost. It is fast, reliable, and easy to tune, so I use it for most things nowadays.

First off is splitting the data in training vs test with an 80/20 Split

#Make it Reproducable
set.seed(10)
sample = sample.split(stats$off, SplitRatio = .8)
train = subset(stats, sample == TRUE)
test = subset(stats, sample == FALSE)

Then we’ll make them fit for XGBoost-ing by forming them into matrices:

dtrain <- xgb.DMatrix(data = as.matrix(train[,-10]), label = train$off)
dtest <- xgb.DMatrix(data = as.matrix(test[,-10]), label = test$off)

Then one of the most important parts, the parameters.

If you have any advice for tuning these in a different fashion or really anything do not be shy. If I made a critical error here, please let me know :)

params <- list(booster = "gbtree", eta=0.01, gamma=5, max_depth=4, 
min_child_weight=2, subsample=0.3, colsample_bytree=0.4,
tree_method = 'hist', scale_pos_weight = 1)

This data was very prone to overfitting so I really had to try hard to avoid that here, but I believe I did a semi-decent job.

Then we can run the model, since it is not very complex (max_depth) the number of rounds is quite high:

xgb1 <- xgb.train(params = params, data = dtrain, nrounds = 1000, 
watchlist = list(val=dtest,train=dtrain))

This gives us an ending RMSE of

[1000] val-rmse:1.191388 train-rmse:1.093189

Which compared to the Standard Deviation

[1] 1.266633

Is somewhat better than random. But I care less if it is exactly the offensive output, but more if it simply correlates well, which is hard to measure because of the players who did not make it to the big leagues.

If we simply take the overall R² it equates to about 0.24, which isn’t fantastic but we also added thousands of default values so it is a bit deceiving.

Here is what the scatter plot looks like without those values

I’d say that in my opinion it is still overfitting some, but not to an egregious point, and is no longer underfitting either.

XGBoost is also kind enough to provide an importance matrix:

Importance Matrix

This is about what I would expect. Total offense at the top, with where they are playing being the second most important. K-BB being that low suprises me a little, but really all the values are quite close.

Now we predict and add it back to the data frame like shown above, I simply did:

pred <- predict(xgb1, newdata = as.matrix(mistats_final[,-10]))
pred2 <- as.data.frame(pred)
# Adding it back
final <- mistats_final %>%
mutate(xOff = pred2$pred)

And then I joined it back with the original data through

mistats$Level <- as.numeric(mistats$Level)final <- left_join(final, mistats, by = c("Level",
"PA", "K_BB",
"ISO", "Spd", "wRC.",
"SwStr.", "BABIP", "Age"))

This gives us their xOff, but that number is wonky and has no scale for us, so we’ll only look at it in terms of xOff+, where like any plus stat 100 is average, and 110 is 10% better than average:

mean <- mean(final$xoff)final <- final %>% 
select(Name, Age, Season, Level, xoff) %>%
mutate(plus = round((xoff - mean) * 100 + 100,0)) %>%
select(-xoff)

Yes I know my variable naming is bad

We’re almost home now, I just want to add some extra stuff to make it read better and have some more info.

# Grabbing Team Names
team <- milb %>%
select(Name, Age, Team, Season)
final <- left_join(final, team, by = c("Name", "Season", "Age"))# Adding Percentiles
final <- final %>%
distinct() %>%
arrange(-plus) %>%
group_by(Level) %>%
mutate(pctl = ntile(plus, 100))
# Fixing Level
final$Level[final$Level == 1] <- "DSL"
final$Level[final$Level == 2] <- "CPX"
final$Level[final$Level == 3] <- "R"
final$Level[final$Level == 4] <- "A-"
final$Level[final$Level == 5] <- "A"
final$Level[final$Level == 6] <- "A+"
final$Level[final$Level == 7] <- "AA"
final$Level[final$Level == 8] <- "AAA"
# Add Some LogosTeam <- c("BOS", "NYY", "BAL", "TOR", "TBR",
"CLE", "DET", "KCR", "MIN", "CHW",
"HOU", "SEA", "OAK", "LAA", "TEX",
"ATL", "PHI", "NYM", "MIA", "WSN",
"MIL", "STL", "CIN", "CHC", "PIT",
"SFG", "LAD", "SDP", "COL", "ARI", "FLA", "TBD")
logo <- c("https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/bos.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/nyy.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/bal.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/tor.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/tb.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/cle.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/det.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/kc.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/min.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/chw.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/hou.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/sea.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/oak.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/laa.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/tex.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/atl.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/phi.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/nym.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/mia.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/wsh.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/mil.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/stl.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/cin.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/chc.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/pit.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/sf.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/lad.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/sd.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/col.png",
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/ari.png",
"https://cdn.freebiesupply.com/logos/large/2x/florida-marlins-1-logo-png-transparent.png",
"https://cdn.freebiesupply.com/logos/large/2x/tampa-bay-devil-rays-5-logo-png-transparent.png")
logo_add <- data.frame(Team, logo)final <- left_join(final, logo_add, by = "Team")

This, rather large, chunk of code is adding the team names and logo links for later use. It also makes the Level back to what we are used to so there’s no 0–8 anymore.

Lastly, lets save the model so we can re-use:

saveRDS(xgb1, file = 'milb_model.RDS')

Results:

Now we can discuss the findings. I already showed the correlation and plot above but lets take a deeper look, and then some indivdual looks as well.

Here’s a handy chart to know the tiers of this stat. I hope you can figure out percentiles on your own though :)

Of course this varies by level, but overall, along with the percentiles given, should be enough to make a proper eval of what it means.

Here are the top 25 since ‘09:

You’ll notice some very high values, suggesting that Mookie Betts was 553% better than average, yet that may be true. Because the average here isn’t the average Major Leaguer, but the average minor leaguer, most of which never reach the show.

The names also suggest some overfitting, which I’m still tinkering with, but I like where it is as of now

Now here’s the best in 2021:

Keibert Ruiz, Tucipita Marcano and many others are there twice because they switched teams, if it has the same xOff+, name, year, and level just assume it is a repeat.

If you are confused on the percentiles, they are done by level, so a 288 xOff+ in A+ ball is in the highest, while in AAA it’s still quite good, but not all the way there.

As seen in the importance matrix, and here, the driving factor really is, can you hit well in the minors:

Frankly, if you cannot hit well vs A+ pitching, how are you going to hit well vs MLB pitching.

Some other notes:

Within the top 5th percentile overall of xOff+, the average Off is 60% better than the average MiLB batter. Within the top 10th percentile, the Off is 40% better. The top 25, about 3% better.

So it is clear that when a batter reaches that upper echilson of MiLB results, they have a significantly more likely chance of both making the big leagues and being a successful offensive player.

Else:

Here is the top 15 wRC+ leaders in 2021 (minus Ohtani duh):

Besides B-Craw, a 17 year old Fernando Tatis Jr, and a 20/21 year old Marcus Semien, all these guys range in the great-elite categories here.

That’s about all I have for you:

If you want to see indivual players, or certain categories let me know on twitter. If you have quesrtions or comments, tweet me or DM me! I love contrusctive critiscm, so please.

Here’s the link to download the data!

Follow me on Twitter: @JacksonDels2

On Instagram: @jacksondels

And have a wonderful day

--

--