2 Introducing the Data

The movie Moneyball follows a low-budget American baseball team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs, compared to statistics typically used like home runs and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

In this tutorial we’ll be looking at data from all US Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics.

Our aim will be to summarise these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

The data set looks at the 2011 season in particular, and has data on all of the Major League Baseball teams, the highest paid baseball league in the United States.

For this tutorial we’ll consider 9 of the variables in the dataset.

The variables we'll consider are:

• team : Name of the team.

• runs : Number of runs.

• at_bats : Number of at bats (a batters turn hitting against a pitcher, i.e. the member of the opposite team throwing the ball at them).

• hits : Number of hits.

• homeruns : Number of home runs.

• bat_avg : Batting average.

• strikeouts : Number of strikeouts.

• stolen_bases : Number of stolen bases.

• wins : Number of wins.

Loading the Data

The data to be analyzed are saved in a .csv file called mlb11.csv which can be imported using the following code.

mlb11 <- read.csv(url("https://raw.githubusercontent.com/Glasgow-Stats-L1-L2/S1Z_Lab3/main/mlb11.csv"))

First of all, it is always a good idea to look at the dataset you are working with to get a good sense of what it looks like, and the different types of data you may have, e.g. categorical or numerical. Recall, the dataset is called mlb11.

Run the code below to look at the first 6 rows of data and the structure of it.

head(mlb11)

str(mlb11)

How many baseball teams are there in America's Major League?

2.1 Exercise 1

Create a plot to display the relationship between runs and at_bats, i.e. with runs as your outcome/response variable and at_bats as your explanatory/predictor variable. As an added bonus, you can add + labs(title = "", x = "", y = "") onto your graph code to add a title and axes labels. Put the appropriate text between the quotation marks.

ggplot(data = ???, mapping = ???) +
    geom_point() +
    labs(???)

ggplot(data = mlb11, mapping = aes(x = at_bats, y = runs)) +
    geom_point() +
    labs(title = "Relationship between Runs and At-Bats", x = "At-Bats", y = "Runs")

Does the relationship between runs and at-bats look linear? If you knew a team’s at-bats, would you be comfortable using a linear model to predict the number of runs?

Yes, the relationship does appear approximately linear. However, we would not be comfortable using a linear model to predict the number of runs from the at-bats. We would have to consider a different varaible to choose. No, the relationship does not appear linear, and hence we would not be comfortable using a linear model to predict the number of runs from the at-bats. We would have to consider a different varaible to choose. No, the relationship does not appear linear. However, despite it's non-linearity we could still use a regression model to predict the number of runs from the at-bats. Yes, the relationship does appear approximately linear. We would therefore be comfortable using a linear model to predict the number of runs from the at-bats.

If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient. There is quite a lot of room for interpretation when looking at correlation coefficients, but a rough guide to interpreting them is given below.

Correlation Coefficient	Verbal Interpretation
0.9 to 1.0 or -0.9 to -1.0	Very strong positive or negative correlation
0.7 to 0.9 or -0.7 to -0.9	Strong positive or negative correlation
0.5 to 0.7 or -0.5 to -0.7	Moderate positive or negative correlation
0.3 to 0.5 or -0.3 to -0.5	Weak positive or negative correlation
0.0 to 0.3 or 0.0 to -0.3	Very weak positive or negative correlation

Run the following code to find the correlation coefficient between runs and at_bats.

mlb11 %>%
    summarise(cor(runs, at_bats))

Looking at your plot and the correlation coefficient you just calculated, how would you describe the relationship between the two variables?

There appears to be a strong positive linear relationship between the runs of a Major League baseball team and their at-bats. There aren't any particularly unusual observations, but the point at roughly (5525, 870) could be considered slightly unusual. There appears to be a weak positive linear relationship between the runs of a Major League baseball team and their at-bats. There aren't any unusual observations. There appears to be a moderate positive linear relationship between the runs of a Major League baseball team and their at-bats. There aren't any unusual observations There appears to be a moderate positive linear relationship between the runs of a Major League baseball team and their at-bats. There aren't any particularly unusual observations, but the point at roughly (5525, 870) could be considered slightly unusual.