7 Bringing it all together


7.1 Exercise 9

Let's look at one of the three newer variables from mlb11, slugging proportion, called new_slug. Slugging proportion is essentially a measure of the productivity of a batter. This is one of the statistics used by the author of Moneyball.

Produce a scatterplot of this variable against runs. Does there seem to be a linear relationship?

ggplot(data = ???, mapping = ???) +
    geom_point() +
    labs(???)
ggplot(data = mlb11, mapping = aes(x = new_slug, y = runs)) +
    geom_point() +
    labs(title = "Relationship between Runs and Slugging Prop.", x = "Slugging Prop.", y = "Runs")


Does the relationship between runs and slugging proportion look linear? If you knew a team’s slugging proportion, would you be comfortable using a linear model to predict the number of runs?


7.2 Exercise 10

Fit a linear model m3 to this pair of variables new_slug and runs.

m3 <- lm(???)

summary(???)
m3 <- lm(runs~new_slug, data = mlb11)

summary(m3)


How does the model between runs and slugging proportion compare to the at-bats and homeruns models?


7.3 Exercise 11

The equation of the regression line is


A: \(\textrm{Slugging Prop} = -376 + 2681*\textrm{Runs}\)

B :\(\textrm{Runs} = 2681 -376*\textrm{Slugging Prop}\)

C: \(\textrm{Runs} = -376 + 2681*\textrm{Slugging Prop}\)

D: \(\textrm{Slugging Prop} = 2681 - 376*\textrm{Runs}\)


What does the slope tell us in the context of the relationship between the success of a team and its slugging percentage?


7.4 Exercise 12

Check the model assumptions of the new model. Do they all hold?

ggplot(???) +
    geom_???() +
    geom_hline(???) +
    labs(???)

ggplot(???) +
    geom_???(???) +
    xlab(???)

ggplot(???) +
    stat_qq() +
    stat_qq_line()
ggplot(???) +
    geom_point() +
    geom_hline(???) +
    labs(x = "Fitted Values", y = "Residuals")

ggplot(???) +
    geom_histogram(???) +
    xlab("Residuals")

ggplot(???) +
    stat_qq() +
    stat_qq_line()
ggplot(data = m3, aes(x = .fitted, y = .resid)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed") +
    labs(x = "Fitted Values", y = "Residuals")

ggplot(data = m3, aes(x = .resid)) +
    geom_histogram(color = "white", binwidth = 15) +
    xlab("Residuals")

ggplot(data = m3, aes(sample = .resid)) +
    stat_qq() +
    stat_qq_line()


Do all the model assumptions hold?


7.5 Exercise 13

Assuming the new model assumptions hold, of the three models considered in this lab, which model do you suggest is best at predicting runs and why?