2 Introducing the Data

The movie Moneyball follows a low-budget American baseball team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs, compared to statistics typically used like home runs and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
In this tutorial we’ll be looking at data from all US Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics.
Our aim will be to summarise these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
The data set looks at the 2011 season in particular, and has data on all of the Major League Baseball teams, the highest paid baseball league in the United States.
For this tutorial we’ll consider 9 of the variables in the dataset.
The variables we'll consider are:
• team : Name of the team.
• runs : Number of runs.
• at_bats : Number of at bats (a batters turn hitting against a pitcher, i.e. the member of the opposite team throwing the ball at them).
• hits : Number of hits.
• homeruns : Number of home runs.
• bat_avg : Batting average.
• strikeouts : Number of strikeouts.
• stolen_bases : Number of stolen bases.
• wins : Number of wins.
Loading the Data
The data to be analyzed are saved in a .csv file called mlb11.csv which can be imported using the following code.
mlb11 <- read.csv(url("https://raw.githubusercontent.com/Glasgow-Stats-L1-L2/S1Z_Lab3/main/mlb11.csv"))First of all, it is always a good idea to look at the dataset you are working with to get a good sense of what it looks like, and the different types of data you may have, e.g. categorical or numerical. Recall, the dataset is called mlb11.
Run the code below to look at the first 6 rows of data and the structure of it.
head(mlb11)str(mlb11)How many baseball teams are there in America's Major League?
2.1 Exercise 1
Create a plot to display the relationship between runs and at_bats, i.e. with runs as your outcome/response variable and at_bats as your explanatory/predictor variable. As an added bonus, you can add + labs(title = "", x = "", y = "") onto your graph code to add a title and axes labels. Put the appropriate text between the quotation marks.
ggplot(data = ???, mapping = ???) +
geom_point() +
labs(???)ggplot(data = mlb11, mapping = aes(x = at_bats, y = runs)) +
geom_point() +
labs(title = "Relationship between Runs and At-Bats", x = "At-Bats", y = "Runs")Does the relationship between runs and at-bats look linear? If you knew a team’s at-bats, would you be comfortable using a linear model to predict the number of runs?
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient. There is quite a lot of room for interpretation when looking at correlation coefficients, but a rough guide to interpreting them is given below.
| Correlation Coefficient | Verbal Interpretation |
|---|---|
| 0.9 to 1.0 or -0.9 to -1.0 | Very strong positive or negative correlation |
| 0.7 to 0.9 or -0.7 to -0.9 | Strong positive or negative correlation |
| 0.5 to 0.7 or -0.5 to -0.7 | Moderate positive or negative correlation |
| 0.3 to 0.5 or -0.3 to -0.5 | Weak positive or negative correlation |
| 0.0 to 0.3 or 0.0 to -0.3 | Very weak positive or negative correlation |
Run the following code to find the correlation coefficient between runs and at_bats.
mlb11 %>%
summarise(cor(runs, at_bats))Looking at your plot and the correlation coefficient you just calculated, how would you describe the relationship between the two variables?