7 Bringing it all together
7.1 Exercise 9
Let's look at one of the three newer variables from mlb11, slugging proportion, called new_slug. Slugging proportion is essentially a measure of the productivity of a batter. This is one of the statistics used by the author of Moneyball.
Produce a scatterplot of this variable against runs. Does there seem to be a linear relationship?
ggplot(data = ???, mapping = ???) +
geom_point() +
labs(???)ggplot(data = mlb11, mapping = aes(x = new_slug, y = runs)) +
geom_point() +
labs(title = "Relationship between Runs and Slugging Prop.", x = "Slugging Prop.", y = "Runs")Does the relationship between runs and slugging proportion look linear? If you knew a team’s slugging proportion, would you be comfortable using a linear model to predict the number of runs?
7.2 Exercise 10
Fit a linear model m3 to this pair of variables new_slug and runs.
m3 <- lm(???)
summary(???)m3 <- lm(runs~new_slug, data = mlb11)
summary(m3)How does the model between runs and slugging proportion compare to the at-bats and homeruns models?
7.3 Exercise 11
The equation of the regression line is
A: \(\textrm{Slugging Prop} = -376 + 2681*\textrm{Runs}\)
B :\(\textrm{Runs} = 2681 -376*\textrm{Slugging Prop}\)
C: \(\textrm{Runs} = -376 + 2681*\textrm{Slugging Prop}\)
D: \(\textrm{Slugging Prop} = 2681 - 376*\textrm{Runs}\)
What does the slope tell us in the context of the relationship between the success of a team and its slugging percentage?
7.4 Exercise 12
Check the model assumptions of the new model. Do they all hold?
ggplot(???) +
geom_???() +
geom_hline(???) +
labs(???)
ggplot(???) +
geom_???(???) +
xlab(???)
ggplot(???) +
stat_qq() +
stat_qq_line()ggplot(???) +
geom_point() +
geom_hline(???) +
labs(x = "Fitted Values", y = "Residuals")
ggplot(???) +
geom_histogram(???) +
xlab("Residuals")
ggplot(???) +
stat_qq() +
stat_qq_line()ggplot(data = m3, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted Values", y = "Residuals")
ggplot(data = m3, aes(x = .resid)) +
geom_histogram(color = "white", binwidth = 15) +
xlab("Residuals")
ggplot(data = m3, aes(sample = .resid)) +
stat_qq() +
stat_qq_line()Do all the model assumptions hold?