5 Prediction and Prediction Errors


Let's return to looking at the relationship between runs and at_bats. We'll combine our graph before with our fitted regression line from the model superimposed on top. Recall the equation for the regression line is:

\[\textrm{Runs} = -2789 + 0.6305*\textrm{At-Bats}\]

ggplot(data = mlb11, aes(x = at_bats, y = runs)) +
    geom_point() +
    labs(title = "Relationship between Runs and At-Bats", x = "At-Bats", y = "Runs") +
    geom_smooth(method = "lm", se = FALSE)

Here, we are literally adding a layer on top of our plot. geom_smooth adds the regression line fitted by minimizing the errors sum of squares. It can also show us the standard error se associated with our line, but we won't show that for now, using the se = FALSE argument. (Although you can set se = TRUE if you're interested in looking at it. It essentially creates a continuous confidence interval for values around the regression line.)

The regression line can be used to predict \(y\) at any value of \(x\) within the range of observed \(x\) values..


Interpolation is when...


Extrapolation is when...

Extrapolation is not usually recommended. Predictions made within the range of the data are more reliable, and are also used to compute the residuals.


Use the fitted model to answer the questions below.

5.1 Exercise 4

To the nearest whole number, the New York Yankees had 867 runs and 5518 at-bats in the 2011 season. The residual for the New York Yankees is


Based on your previous answer, is the fitted value for the New York Yankees an overestimate or underestimate of their runs?


5.2 Exercise 5

To the nearest whole number. the Chicago Cubs had 654 runs and 5549 at-bats in the 2011 season. The residual for the Chicago Cubs is (to 2 decimal places)


Based on your previous answer, is the fitted value for the Chicago Cubs an overestimate or underestimate of their runs?