Introducing The Data

In this tutorial, we will explore and visualise the data using the tidyverse suite of packages and perform statistical inference.

We will be looking at data from the Youth Risk Behaviour Surveillance System (YRBSS) survey, which uses data from American high school students to help discover health patterns. The data frame is called yrbss and is part of the openintro package should you wish to look at it yourself.

It contains 13 variables:

age : Age of the student.

gender : Gender of the student.

grade : School grade of the student.

hispanic : If the student is hispanic, or not.

race : Ethnicity of the student

height : Height of the student, in metres.

weight : Weight of the student, in kilograms.

helmet_12m : How often the student wore a helmet while riding a bike in the last 12 months.

text_while_driving_30d : How many days out of the last 30 did the the student text while driving.

physically_active_7d : How many days out of the last 7 was the student physically active for at least an hour.

hours_tv_per_school_day : How many hours of TV does the student typically watch on a school night.

strength_training_7d : How many days out of the last 7 did the student lift weights.

school_night_hours_sleep : How many hours of sleep does the student typically get on a school night?

First of all, it is always a good idea to look at the dataset you are working with to get a good sense of what it looks like, and the different types of data you may have, e.g. categorical or numerical.

Load the yrbss data set into your workspace by typing the following code in an Rscript file in your local Rstudio.

yrbss<-read.csv(url("https://raw.githubusercontent.com/Glasgow-Stats-L1-L2/S1Z_Lab1/main/yrbss.csv"))

You may also download the dataset named yrbss.csv file from our Moodle course page. Save the file in your local computer and then use the read.csv() function to import the data in your R. You can load the data set into your workspace by running the code below in an Rscript file in your local Rstudio. Make sure that the yrbss.csv file is saved in the same folder as the Rscript file.

Copy and run the code below to look at the first 6 rows of data and the structure of it.

head() is a function where you put in it's brackets the name of the data frame, in this case yrbss, and it gives you the first 6 lines of that data to look at. You can press the arrows to go along and see all the variables.

str() is another function again where you put the name of the data frame in the brackets, and this time it gives you the number of observations and the number of variables you have, the types of those variables e.g. "int" for an integer, "num" for a number and "chr" for a string of characters, as well as the first few values of each variable.

head(yrbss)
str(yrbss)


How many observations/rows are there in the entire data set?

Look again at the result of the str() function.


It is also worth checking if there are missing values in this data set.

To find how many observations we are missing from the weight variable you can run the following R code:

sum(is.na(yrbss$weight))

The is.na() function creates a vector that corresponds to the variable provided, (weight in this instance, from our dataset yrbss) with TRUE for a missing value and FALSE for a non-missing value. The sum() function then adds up the elements in this created vector such that TRUE is equivalent to 1 and FALSE is equivalent to 0 and hence the sum of this vector is equivalent to the number of TRUE entries which calculates how many of these "NA" values there are.


How many observations are we missing weights from? You will find the previous example useful.

Use the function sum(is.na(YRBSS$...)).


We can remove observations with missing values and create a data set without these using the code below:

yrbss <- na.omit(yrbss)


How many observations/rows are there in the new entire data set (with missing values removed)?

Look again at the result of the str() function.