2 2021 Paper 1 Example

2.1 Question 1

In this example, we will take a look at question 1 from Paper 1 in 2021. This question is a report style question based on Google AI data of times taken to draw a cat or a dog. The example contains a stem and leaf diagram with the combined data and some summary statistics for both sets of drawings. Following this, a Mann-Whitney Test is carried out to test whether both samples have different average drawing times.

We will now look at how we can simulate a sample of the data from the question, by randomly sampling data for both groups using properties from their summary statistics

2.1.1 Generating a random sample of data

We can generate a random sample of data for both the categories using the summary statistics provided. As we can see from the stem and leaf diagram, the data have a lower bound, where we cannot observe any data below zero, as the data recorded are based on time elapsed.

In order to sample data of this form, we can use a variation of the Normal distribution called the truncated normal distribution, which allows us to bound a Normal distribution through a given range.

To randomly sample data from this distribution, we can use the rtruncnorm function from the truncnorm package in R as follows:

sample_cat <- rtruncnorm(n=121, a=0, b=14, mean=5.4, sd=2.31)
sample_dog <- rtruncnorm(n=145, a=0, b=16, mean=7.5, sd=2.66)

The parameters required are defined as follows:

n - the number of samples we wish to draw
a - the lower bound of the distribution (here, we will set this to 0)
b - the upper bound of the distribution (here, we have set this to be the ceiling of the max value for each group)
mean - the mean of each group
sd - the standard deviation of each group

Running the code above will produce a sample for both groups based on their relative summary statistics. We can check the summary statistics of our data using summary()

summary(sample_cat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3641  3.4919  4.7518  5.0838  6.3575 11.2882

summary(sample_dog)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2323  6.4623  8.1252  7.9339  9.3019 15.5964

We can now compare these summary statistics to those of the real data provided in the paper and notice that our simulated data produces fairly similar summary statistics.

Exercise Try editing some of the parameters in rtruncnorm to see how this changes the summary statistics you obtain.

2.1.2 Visualising the data

We can also check the distribution of our sampled data for comparison by visualising it using a histogram. The ggplot2 library found in the tidyverse library provides several functions for data visualisation and has become more popular than base R graphics. We will use the geom_histogram() function from the library in this example as follows:

dog_hist <- ggplot(data.frame(sample_dog),aes(x=sample_dog))+
  geom_histogram(color="black",fill="white") + 
            labs(title="Average draw time of dogs",x="Time (s)")
cat_hist <- ggplot(data.frame(sample_cat),aes(x=sample_cat))+
  geom_histogram(color="black",fill="white") + 
  labs(title="Average draw time of cats",x="Time (s)")

grid.arrange(cat_hist, dog_hist,ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To create these histograms, the code above works in the following fashion:

We first specify our data using ggplot(data), where our data here is either group of samples.
To specify which variables we would like to select, we use the aes() argument. As we only have one sample of data in each case, we specify this using x=data.
We then generate the histogram using geom_histogram(), where we can define the line colour using color and the filled colour of the bars using fill.
We can label our plot using the labs argument, where we can include a title and an x axis label

We can also alter the number of bins we use (ggplot will set a standard number of bins by default) using the bins argument. Let’s alter the number of bins for the dog data to be 10

dog_hist <- ggplot(data.frame(sample_dog),aes(x=sample_dog))+
  geom_histogram(color="black",fill="white",bins=10) + 
  labs(title="Average draw time of dogs",x="Time (s)")

Exercise Try edit some of the inputs to the histogram plot. You can change the colour scheme, number of bins and the labels for the plot.

2.1.3 Setting random seeds for reproducibility

When we randomly sample data each time in R, we will obtain a different sample than before. Let’s run our previous code twice to see if there is any differences:

sample_dog1 <- rtruncnorm(n=145, a=0, b=16, mean=7.5, sd=2.66)
sample_dog2 <- rtruncnorm(n=145, a=0, b=16, mean=7.5, sd=2.66)

summary(sample_dog1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4464  5.7736  7.3936  7.5150  9.2837 13.4899

summary(sample_dog2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2391  5.3456  7.5616  7.4537  9.2504 13.4568

We see that both samples produce different summary statistics. This can cause difficulty when you are working on a specific problem and want to design questions around the specific characteristics of the data you have sampled the first time.

We can force R to use the same random number generation by using the set.seed() function. Here, we specify the seed from the random number generator we want to use each time we generate samples. This number can be any number you wish to choose! The example below highlights how this works:

set.seed(2023)
sample_dog1 <- rtruncnorm(n=145, a=0, b=16, mean=7.5, sd=2.66)
set.seed(2023)
sample_dog2 <- rtruncnorm(n=145, a=0, b=16, mean=7.5, sd=2.66)

summary(sample_dog1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.006   6.316   7.545   7.798   9.365  14.776

summary(sample_dog2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.006   6.316   7.545   7.798   9.365  14.776

Here, we see we can produce the same data as the first sample by setting the seed prior to sampling.

Exercise Try setting your own choice of random seed here and see how the summary statistics change.

2.1.4 Manually editing data

If we take a sample of data and perhaps wish to add some additional variables to mimic the original data closer, this can easily be done in R. Let’s take a sample for the dog data but lower our boundary to 14 and visualise.

set.seed(2023)
sample_dog <- rtruncnorm(n=145, a=0, b=14, mean=7.5, sd=2.66)

ggplot(data.frame(sample_dog),aes(x=sample_dog)) +
  geom_histogram(color="black",fill="white",bins=12) + 
            labs(title="Average draw time of dogs",x="Time(s)")

When comparing this to the original data from the paper, we see that we do not observe the two outliers at 14.3 seconds and 15.2 seconds. In order to maintain the same sample size, we can reduce our samples here to be 143, and then manually add the two outliers. We can manually add these values as follows:

set.seed(2023)
sample_dog <- rtruncnorm(n=143, a=0, b=14, mean=7.5, sd=2.66)
sample_dog <- c(sample_dog, c(14.3,15.2))

ggplot(data.frame(sample_dog),aes(x=sample_dog)) +
  geom_histogram(color="black",fill="white",bins=12) + 
            labs(title="Average draw time of dogs",x="Time (s)")

We now observe the two additional values in the histogram. For a vector of data, we can easily add new values using the c() command. This can be done by specifying our original data first, and then including our additional variables in a new c() object. If we only wish to add one value, we do not need to use c().

We can also make this update in a slightly different way. If we take our original sample of 145, we can directly replace two of the observations we have. We can replace the two highest values from our sampled data with the two outliers as follows:

set.seed(2023)
sample_dog <- rtruncnorm(n=145, a=0, b=14, mean=7.5, sd=2.66)

# Order the data from higest to lowest
sample_dog <- sample_dog[order(sample_dog,decreasing=TRUE)]

# Replace the 2 highest values with our outliers
sample_dog[1:2] <- c(14.3,15.2)

ggplot(data.frame(sample_dog),aes(x=sample_dog)) + geom_histogram(color="black",fill="white",bins=12) + 
            labs(title="Average draw time of dogs",x="Time (s)")

Here, we have used the order function to sort our data from the highest numerical observation to the lowest. The decreasing=TRUE argument sets this order from highest to lowest. If you wanted to replace the two lowest arguments, you can simply drop the decreasing=TRUE argument here. We then update the first two observations with our two outliers.

Exercise Try replacing the two lowest values of your sampled data with the two identified outliers.

2.1.5 Sampling from different distributions

In this case, the data shown roughly takes the form of a Normal distribution. There are several cases where this may not be an appropriate distribution. R contains a full range of different distributions we can sample from. Some common choices are shown below:

rbinom() - Sample from a Binomial distribution
rchisq() - Sample from a Chi-squared distribution
rgamma() - Sample from a Gamma distribution
runif() - Sample from a Uniform distribution
rexp() - Sample from an Exponential distribution

This list is not exhaustive, and R will contain libraries that will allow sampling from almost any known distribution.

Exercise Carry out the steps we have worked through above, but this time using a different distribution of your choice. You can find information on what parameters need to be specified using the ? notation before the function name (e.g. ?rexp)

2.1.6 Performing a Mann-Whitney Test

We can carry out a Mann-Whitney test in R as follows

sample_dog <- rtruncnorm(n=145, a=0, b=16, mean=7.5, sd=2.66)
sample_cat <- rtruncnorm(n=121, a=0, b=14, mean=5.4, sd=2.31)

mann_whitney <- wilcox.test(sample_dog,sample_cat)
mann_whitney

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  sample_dog and sample_cat
## W = 13000, p-value = 1.33e-11
## alternative hypothesis: true location shift is not equal to 0

The output from this test gives us the rank sum W and a p-value which can be used for question creation

2.2 Performing a z-test

We can perform a two-sample z test in R using the z.test function from the BSDA package. We can carry out this test as follows:

z.test(sample_cat,sample_dog,sigma.x=2.307,sigma.y=2.655)

## 
##  Two-sample z-Test
## 
## data:  sample_cat and sample_dog
## z = -7.1488, p-value = 8.752e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.771824 -1.578983
## sample estimates:
## mean of x mean of y 
##  5.286729  7.462132

Here, we specify the sample standard deviation for cats and dogs respectively. The output from the test gives us the z-statistic, p-value and 95% confidence interval.

1 Overview

3 Linear model example