Data types and classes

Lecture 8

2024-05-28

Announcements

Exam 1 in class next week on Monday – cheat sheet (1 page, both sides, hand-written or typed, must be prepared by you)
Exam 1 take home starts after class on Monday, due at 11:59 PM on Thursday (open resources, internet, etc., closed to other humans) - I nor the TA will be available to answer questions!!
Next class: Exam 1 review – come with questions!

Study tips for the exam

Go over lecture materials and application exercises
Review labs and feedback you’ve received so far
In particular, please go over Lab 2 and make sure you understand every question.
Do the exercises at the end of readings from both books
Do the exam review
Please come to class and office hours with questions! There are no stupid questions!

Questions from last semester

Pivoting data

Suppose we have the following patient data:

patients

# A tibble: 3 × 4
  patient_id pulse_1 pulse_2 pulse_3
  <chr>        <dbl>   <dbl>   <dbl>
1 XYZ             70      85      73
2 ABC             90      95     102
3 DEF            100      80      70

And we want to know:

Average pulse rate for each patient.
Trends in pulse rates across measurements.

Pivoting data

Suppose we have the following patient data:

patients

# A tibble: 3 × 4
  patient_id pulse_1 pulse_2 pulse_3
  <chr>        <dbl>   <dbl>   <dbl>
1 XYZ             70      85      73
2 ABC             90      95     102
3 DEF            100      80      70

And we want to know:

Average pulse rate for each patient.
Trends in pulse rates across measurements.

These require a longer format of the data where all pulse rates are in a single column and another column identifies the measurement number.

Pivoting data

patients_longer <- patients |>
  pivot_longer(
    cols = !patient_id,
    names_to = "measurement",
    values_to = "pulse_rate"
  )

Summarizing pivoted data

patients_longer |>
  group_by(patient_id) |>
  summarize(mean_pulse = mean(pulse_rate))

# A tibble: 3 × 2
  patient_id mean_pulse
  <chr>           <dbl>
1 ABC              95.7
2 DEF              83.3
3 XYZ              76

Visualizing pivoted data

ggplot(
  patients_longer, 
  aes(x = measurement, y = pulse_rate, group = patient_id, color = patient_id)
  ) +
  geom_line()

Types and classes

Type is how an object is stored in memory, e.g.,
- double: a real number stored in double-precision floatint point format.
- integer: an integer (positive or negative)
Class is metadata about the object that can determine how common functions operate on that object, e.g.,
- factor

Types of vectors

You’ll commonly encounter:

logical
integer
double
character

You’ll less commonly encounter:

list
NULL
complex
raw

Types of functions

Yes, functions have types too, but you don’t need to worry about the differences in the context of doing data science.

typeof(mean) # regular function

[1] "closure"

typeof(`$`) # internal function

[1] "special"

typeof(sum) # primitive function

[1] "builtin"

Factors

A factor is a vector that can contain only predefined values. It is used to store categorical data.

x <- factor(c("a", "b", "b", "a"))
x

[1] a b b a
Levels: a b

typeof(x)

[1] "integer"

attributes(x)

$levels
[1] "a" "b"

$class
[1] "factor"

Other classes

Just a couple of examples…

Date:

today <- Sys.Date()
today

[1] "2024-05-28"

typeof(today)

[1] "double"

attributes(today)

$class
[1] "Date"

Date-time:

now <- as.POSIXct("2024-02-08 11:45", tz = "EST")
now

[1] "2024-02-08 11:45:00 EST"

typeof(now)

[1] "double"

attributes(now)

$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] "EST"

Application exercise

`ae-07-population-types`

Go to the project navigator in RStudio (top right corner of your RStudio window) and open the project called ae.
If there are any uncommitted files, commit them, and then click Pull.
Open the file called ae-07-population-types.qmd and render it.