Data and Analytics Resources

Provide Confident Assurance to Your Organization

Using Linear Regression to Predict a Pitcher's Performance

by Alex Hagen

Nov 09, 2017

For better or worse, the game of baseball has changed drastically in just the last few years.  “Small ball” (an approach to baseball that involves base hits, sac flys, bunts) is dying.  Players are swinging for the fences every chance they get and thus, are striking out at a higher rate than ever before. Balls are being put in play less and less.

Why is this happening? The use of real-time data is challenging players to push the boundaries of athletic performance. In 2007, MLB introduced StatCast which gives teams access to data never imagined before. Over the last 10 years, we’ve seen baseball transform from a competitive team sport to a more individualized performance sport driven by the desire to improve one’s own statistics. A great show of sportsmanship is now more about the performance of the individual than the W or L of the team.

This “analytics revolution” in baseball is just beginning.

  • Analysts are working to predict injuries to pitchers just by looking at minute differences in pitch velocity and spin rate.
  • Batters finish their “at bat” and go back to the dugout and check what their “launch angle” and “exit velocity” were.
  • Outfielders check the piece of paper in their back pocket to see exactly where they need to stand for the current batter based on that batter’s spray chart.
  • “The shift” is a new concept that baseball traditionalists scoff at.  Infielders move from their normal position and move the other side of the infield because the data says that that is where the ball is most likely to go.
  • Pitchers are being trained to throw the ball harder and harder than ever before.

Don’t believe me? Let’s let the data speak for itself. Does the data support this new trend of pitchers throwing harder than ever before? We will use simple linear regression and compare the ERA of pitchers in 2016 to their average fastball speed.

For those who do not know, ERA is a commonly used statistic for pitchers that stands for Earned Run Average.  It is the number of runs scored against that pitcher per nine innings pitched. The lower the ERA the better.

What is Linear Regression? 





















A linear regression involves plotting a line that best represents a scatterplot of points, like the one above. The line of best fit is the line that minimizes the total squared distance from each point to the line. You can use the equation of the line to predict future point values.  For example, if X were 7.3 we can use the equation to predict the value of Y.  Y = 3.0273*(7.3) + 2.0106 = 24.10989.

Another important value to note in a linear regression model is correlation. Correlation is a value between -1 and 1 that describes how well the scatterplot fits to a line. Positive correlation means as X increases, so does Y. Negative means as X increases, Y decreases.  The closer correlation is to 1 or -1 then the better the data fits to a line.  A correlation of 0 means there is no correlation.  The correlation for this example is 0.997688. 

Gathering the data

We will use two different data sources for this pitch speed analysis. The first is called the Lahman Database. Run by a man named Sean Lahman, the Lahman database has complete end of season stats going all the way back to 1871. At you will find a Microsoft Access version, a CSV version, and a SQL version.  For this demo we will be using the CSV “Pitching” table from the database.  You’ll find a description of all the columns here:


Small portion of the “Pitching” table from the Lahman Database

The Lahman database covers a lot of different stats but it gives you no way to explain “why” players are or are not having success. This is where’s PitchF/x data comes into play.  As of the 2008 season, we have data for every pitch thrown including the type of pitch, where it crossed the plate, the speed, how much break it had and in what direction, and much more. is a great resource describing what all the columns from the pitchF/x data mean.


Example of what the raw data looks like from MLB’s website.

This data powers the MLB At Bat application.  The picture above is all the data from an at bat by Evan Longoria of the Rays with Danny Barnes of the Blue Jays pitching.  At the top, you’ll see the result of the at bat (“des”) which was “Evan Longoria walks. Brad Miller to 2nd.”  I’ve highlighted a few key stats from the penultimate pitch thrown to Longoria.  You can see there that type = “B”(ball), pitch_type = “CH” (changeup), and start_speed = 80.2 (80 mph when released).


Barnes throws an 80 MPH Changeup for a ball to Evan Longoria. Screenshot taken from MLB’s At Bat application.

The next step will be getting the PitchF/x data, which is an XML file, into a form that we can digest in R. Then we will join these databases together so we can look at pitcher’s fastball speed and compare this to their ERA, both for the 2016 season. As a reminder, the lower the ERA, the better.

Massaging the data

Finally, time to start using some R code.

Using the “pitchRx” package and the “scrape” command we can pull pitch data from the MLB website for specific games or for a date range into a nice data frame that is easy to use.  You’ll find all documentation for the “pitchRx” package here:

GAME <- scrape(game.ids = "gid_2017_07_19_tbamlb_oakmlb_1")
GAMES_April <- scrape("2016-04-03","2016-04-15")


This gives you 5 tables. The best way to utilize the results are to combine the pitch table with the atbat table. This way you have every pitch that was thrown as well as the results of the at-bat.

Pitches_April <- plyr::join(GAMES_April$atbat, GAMES_April$pitch, by=c("num", "url"), type="inner")


Unfortunately, the scrape command has a limit of 200 games per use.  To get around this we’ll simply use the “scrape” command multiple times and combined the results into one table.

GAMES_April_2 <- scrape("2016-04-16","2016-04-28")
Pitches_April_2 <- plyr::join(GAMES_April_2$atbat, GAMES_April_2$pitch, by=c("num", "url"), type="inner")
Pitches_All <- plyr::join(Pitches_April, Pitches_April_2, type = "full")


Rinse and repeat.

Now we need to join this table with the pitching table from the Lahman database.  Unfortunately, there is no direct way to join these tables together.  MLB uses unique numbers to identify pitchers and Lahman uses a playerID column.  The Lahman database has a “master” table for this purpose but the master table does not have the MLBcode. has a table with both playerID and MLBcode but this table is incomplete.
To join the tables, we used Lahman’s “Master” table and made a new “pitcher_name” column and joined the tables using the pitcher’s full names.

setwd(“#MY FOLDER#")
Pitching_Stats <- read.csv("pitching.csv"
Pitching_Stats <- subset(Pitching_Stats, Pitching_Stats$yearID == 2016)
Master <-  read.csv("master.csv")
Master <- subset(Master, select = c("playerID","nameFirst","nameLast"))
Master$pitcher_name <- paste(Master$nameFirst, Master$nameLast, sep=" ")
Pitching_Stats_with_names <- plyr::join(Pitching_Stats, Master, by= "playerID", type="inner")
Pitches_All_and_Stats <- plyr::join(Pitching_Stats_with_names, Pitches_All, by= "pitcher_name", type="inner")

This is a very large table, with which a lot of analysis can be done.  If you choose to go a different direction from here please let me know what insights you find!

We, however, are going to start trimming the fat to look at only the columns we need for this analysis.

Pitches_All_and_Stats <- subset(Pitches_All_and_Stats , stint == 1)
Pitches_All_and_Stats <- subset(Pitches_All_and_Stats, IPouts >= 150)
Pitches_Fastballs_and_Stats <- subset(Pitches_All_and_Stats, pitch_type %in% c("FA", "FF", "FT"))
Pitches_Fastballs_and_Stats <- subset(Pitches_Fastballs_and_Stats, select = c("pitcher_name","start_speed","ERA"))

The “stint == 1” is to ensure there is only one row for each pitcher.  Pitchers who pitched for multiple teams in one year will have more than one row for each team.  “IPouts >= 150” is to filter the results to only include pitchers who have thrown at least 50 innings (50 innings = 150 Outs) to avoid pitchers with a small sample size.

Now we have every fastball thrown for pitchers who threw at least 50 innings in 2016 along with the name of the pitcher and that pitcher’s 2016 ERA.  Now the only thing left to do is to average the fastball speed so we have one row for each pitcher and then rename the column back to “pitcher_name.”

Pitchers_Fastballs_ERA <- aggregate(Pitches_Fastballs_and_Stats[,2:3], list(Pitches_Fastballs_and_Stats$pitcher_name), mean)
colnames(Pitchers_Fastballs_ERA)[colnames(Pitchers_Fastballs_ERA)=="Group.1"] <- "pitcher_name"



First 17 rows of “Pitchers_Fastballs_ERA” table.

Creating the Linear Model

We’ll start by looking at the correlation between start_speed and ERA.


The correlation we get is -0.234668 which is not a particularly strong correlation but it shows that there is some sort of relationship between speed and ERA.  As speed increases, ERA tends to decrease.

Now, let’s create a linear model.

lm_ERA <- lm(formula = ERA ~ start_speed, data = Pitchers_Fastballs_ERA)


    Min      1Q  Median      3Q     Max 
-2.2760 -0.8130 -0.0811  0.6954  5.3207 
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.90057    2.78855   5.343 1.99e-07 ***
start_speed -0.11759    0.03021  -3.893 0.000126 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.147 on 260 degrees of freedom
Multiple R-squared:  0.05507, Adjusted R-squared:  0.05143 
F-statistic: 15.15 on 1 and 260 DF,  p-value: 0.0001261

Our line of best fit has the equation:

ERA = 14.90057 – 0.11759*(start_speed)

Using this equation we can predict ERA using start_speed but how accurate would this prediction be?

Also, worth noting are the p-value for start_speed and the adjusted R-squared value.  The p-value for start speed (represented by Pr(>|t|) is 0.000126.  What this means is that there is a 0.0126% chance that our results just happen by coincidence.  This tells me that fastball speed does have a statistically significant impact on ERA.

The adjusted R-squared value is 0.05143.  What this means is 5.143% of the variance in ERA is explained by our model.  Not a large amount unfortunately.  This tells me that, while fastball speed does influence performance, it is only just one part of it.

Using the ggplot2 package we can create a scatterplot of our data with the line of best fit.

       aes(x = `start_speed`, y = `ERA`)) + geom_point() +
  theme(panel.border = element_rect(color = "black", fill = NA, size = 1),
        panel.background = element_rect(fill = "white"),
        panel.grid.major = element_line(color = "grey", linetype = "dashed")) +
  stat_smooth(method = "lm", color = "orange", size = 1, level = 0.95)

The shaded gray area represents the confidence interval.  You’ll see in the code that we set the confidence level for the “stat_smooth” argument to 95%.

Testing Assumptions and Improving the Model

The first thing you should always check when making a linear regression model is that the residuals are normally distributed (bell curve).  Residuals are the distance of each point to the line of best fit. If the residuals fit a normal distribution then this tells us that a linear model makes sense and not a polynomial one.

lm_ERA_resid <-$residuals)
plot(density(lm_ERA$residuals),xlab = "Residuals", main = "Distribution of Residuals")



This does look normally distributed with a little skew to the right. To confirm that it’s normally distributed, let’s create a Q-Q Plot. A Q-Q plot compares our data to that of a theoretical normal distribution.  If the graph forms a straight line then our residuals are normally distributed.

qqnorm(lm_ERA$residuals, main = "Q-Q Plot to Test Normality of the Residuals")
qqline(lm_ERA$residuals, lwd = 3, col = "red")


Our data fits the line well, but it appears there may be some outliers.  We can use a box plot to check for outliers and then remove them. 

boxplot <- boxplot(Pitchers_Fastballs_ERA$start_speed, Pitchers_Fastballs_ERA$ERA, main = "Fastballs and ERA Boxplot")
ERA_Boxplot <- boxplot(Pitchers_Fastballs_ERA$ERA, main = "ERA Boxplot")
Speed_Boxplot <- boxplot(Pitchers_Fastballs_ERA$start_speed, main = "Speed Boxplot")

You can read more about how a boxplot is calculated here

boxplot_outliers <- data.frame(boxplot$out, boxplot$group)

1    83.10969             1
2    83.23261             1
3     7.59000             2
4     9.36000             2
5     8.02000             2

Now let’s mark which points in our data are outliers and graph it again

boxplot_outliers_ERA <- Pitchers_Fastballs_ERA[Pitchers_Fastballs_ERA$ERA %in% boxplot_outliers[3:5, 1], ]
boxplot_outliers_Speed <- Pitchers_Fastballs_ERA[Pitchers_Fastballs_ERA$start_speed %in% boxplot_outliers[1:2, 1], ]
boxplot_outliers_ERA_names <- rownames(boxplot_outliers_ERA)
boxplot_outliers_Speed_names <- rownames(boxplot_outliers_Speed
boxplot_outliers_names <- c(boxplot_outliers_ERA_names, boxplot_outliers_Speed_names)
ggplot(Pitchers_Fastballs_ERA, aes(x = `start_speed`, y = `ERA`)) + geom_point() +  theme(panel.border = element_rect(color = "black",  fill = NA, size = 1),
      panel.background = element_rect(fill = "white"), panel.grid.major = element_line(color = "grey", linetype "dashed")) +
              stat_smooth(method = "lm", color = "orange", size = 1, level = 0.95) +
               geom_point(data = Pitchers_Fastballs_ERA[boxplot_outliers_names,],
               aes(x = Pitchers_Fastballs_ERA[boxplot_outliers_names, ]$start_speed,
                 y = Pitchers_Fastballs_ERA[boxplot_outliers_names, ]$ERA), color = "red",
size = 3)

The red points represent the outliers.  Now let’s take these points out and recreate the model.

Pitchers_Fastballs_ERA_No_Outliers <- subset(Pitchers_Fastballs_ERA, !rownames(Pitchers_Fastballs_ERA) %in% boxplot_outliers_names)

ggplot(Pitchers_Fastballs_ERA_No_Outliers, aes(x = `start_speed`, y = `ERA`)) +
  geom_point() + theme(panel.border = element_rect(color = "black", fill = NA, size = 1),                       panel.background = element_rect(fill = "white"), panel.grid.major = element_line(color = "grey", linetype = "dashed")) +  stat_smooth(method = "lm", color = "orange", size = 1, level = 0.95)




Correlation has improved from -0.235 to -0.265

lm_ERA_No_Outliers <- lm(formula = ERA ~ start_speed, data = Pitchers_Fastballs_ERA_No_Outliers)


    Min      1Q  Median      3Q     Max 
-2.2256 -0.7723 -0.0313  0.6986  3.2559 
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16.00280    2.72378   5.875 1.31e-08 ***
start_speed -0.12998    0.02948  -4.409 1.53e-05 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.052 on 255 degrees of freedom
Multiple R-squared:  0.07082, Adjusted R-squared:  0.06718 
F-statistic: 19.44 on 1 and 255 DF,  p-value: 1.535e-05

Our new formula is

ERA = 16.00280 - 0.12998*(start_Speed)

What we see here is that, with such a low p-value, fastball speed definitely influences ERA.  However, even though it has improved, the adjusted R-squared is still only 0.06718 which means that our model still only explains less than 7% of the variation in ERA.

Multivariate Linear Model

To help improve the model, we can add more factors.  Let’s use my new favorite visualization to look at correlation between multiple objects: ERA, fastball speed, spin rate on fastballs, spin rate on curveballs and sliders, and salary.

The ggpairs command in the “GGally” package displays a correlation matrix with a scatterplot comparing all the columns to each other, a histogram showing the distribution of each column, and each correlation value.

        lower = list(continuous = "smooth"), diag = list(continuous = "barDiag"))



Looking at this, it appears spin rate does influence ERA, although the correlation is not as strong as it is for fastball speed.  Salary does not have an impact and actually decreases the adjusted R- squared value of the model.  Salary is often a representation of how long a player has been in the league, less so how well they perform.

multivariate_lm_ERA <- lm(formula = ERA ~ start_speed + spin_rate + spin_rate_cu, data = Pitchers_No_Outliers)


     Min       1Q   Median       3Q      Max 
-2.23046 -0.77149 -0.07979  0.70964  3.08181 
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  17.8763102  2.7985197   6.388 7.93e-10 ***
start_speed  -0.1431117  0.0304353  -4.702 4.22e-06 ***
spin_rate    -0.0001266  0.0002769  -0.457   0.6480    
spin_rate_cu -0.0003309  0.0001638  -2.021   0.0444 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.07 on 255 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.09316, Adjusted R-squared:  0.08249 
F-statistic: 8.732 on 3 and 255 DF,  p-value: 1.556e-05



The new equation to predict ERA is:

ERA = 17.876 – 0.143*(start_speed) – 0.000126*(spin_rate) – 0.0003309*(spin_rate_cu)


The very small p-value associated with start_speed shows that fastball speed does have a statistically significant impact on performance.  However, the adjusted R-squared value is still only 0.0825 so there is still a lot that this model is unable to explain.  So, while fastball speed is an important factor, it is certainly not the only one that is necessary to be a good pitcher.

One could continue to improve the model from here.  Some factors that we did not include but intuitively would influence performance are: location of pitches, variance of speed and/or spin between pitches, pitch selection, and many others that may have yet to be discovered.