Data visualization overview

Lecture 4

2024-05-21

Announcements

  • Lab 1 due Thursday evening at 11:59 PM on Gradescope.

  • Recall, we have a no late work policy! Please don’t wait until 11:55 PM to try to submit your work to Gradescope.

  • AEs this week should be submitted by midnight on Sunday. To “submit”, commit and push at least once to your ae repo for each application exercise this week.

  • 📘 ms - chp 5 is particularly useful for Lab 1.

  • AE 2 solutions posted.

Questions

  • How detailed should my notes be when doing the preparation reading?

  • How to fix: code/text cut off when Rendering to PDF.

  • Useful function: View() function.

Visualizing various types of data

Identifying variable types

Identify the type of each of the following variables.

  • Favorite food
  • Number of classes you’re taking this semester
  • Zip code
  • Age

The way data is displayed matters

What do these three plots show?

Visualizing penguins

library(tidyverse)
library(palmerpenguins)
library(ggthemes)

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Univariate analysis

Univariate analysis

Analyzing a single variable:

  • Numerical: histogram, box plot, density plot, etc.

  • Categorical: bar plot, pie chart, etc.

Histogram - Step 1

ggplot(
  penguins
  )

Histogram - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Histogram - Step 3

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram()

Histogram - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 250
  )

Histogram - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 250
  ) +
  labs(
    title = "Weights of penguins",
    x = "Weight (grams)",
    y = "Count"
  )

Boxplot - Step 1

ggplot(
  penguins
  )

Boxplot - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Boxplot - Step 3

ggplot(
  penguins,
  aes(y = body_mass_g)
  ) +
  geom_boxplot()

Boxplot - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_boxplot()

Boxplot - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_boxplot() +
  labs(
    x = "Weight (grams)",
    y = NULL
  )

Density plot - Step 1

ggplot(
  penguins
  )

Density plot - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Density plot - Step 3

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density()

Density plot - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1"
  )

Density plot - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2
  )

Density plot - Step 6

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2,
    color = "darkorchid3"
  )

Density plot - Step 7

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2,
    color = "darkorchid3",
    alpha = 0.5
  )

Weights of penguins

::: task ::: columns ::: {.column width=“70%”}

TRUE / FALSE

  • The distribution of penguin weights in this sample is left skewed.
  • The distribution of penguin weights in this sample is unimodal.

:::

Bivariate analysis

Bivariate analysis

Analyzing the relationship between two variables:

  • Numerical + numerical: scatterplot

  • Numerical + categorical: side-by-side box plots, violin plots, etc.

  • Categorical + categorical: stacked bar plots

  • Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot

Side-by-side box plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    y = species
    )
  ) +
  geom_boxplot()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species
    )
  ) +
  geom_density()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density()

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  )

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  ) +
  theme(
    legend.position = "bottom"
  )

Questions from Lab 1

Many of the questions in Lab 1 are subjective. How does that work?

identify at least one outlier

Questions from Lab 1

Many of the questions in Lab 1 are subjective. How does that work?

identify at least one outlier ✅

Questions from Lab 1

Many of the questions in Lab 1 are subjective. How does that work?

identify at least one outlier ❌

Mid-class recap

Packages

library(palmerpenguins)
library(tidyverse)
library(ggthemes)

Bivariate analysis

# Side-by-side box plots
ggplot(penguins, aes(x = body_mass_g, y = species, fill = species)) +
  geom_boxplot(alpha = 0.5, show.legend = FALSE) +
  scale_fill_colorblind() +
  labs(
    x = "Body mass (grams)", y = "Species",
    title = "Side-by-side box plots"
  )
# Density plots
ggplot(penguins, aes(x = body_mass_g, fill = species)) +
  geom_density(alpha = 0.5) +
  theme(legend.position = "bottom") +
  scale_fill_colorblind() +
  labs(
    x = "Body mass (grams)", y = "Density",
    fill = "Species", title = "Density plots"
  )

Violin plots

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin()

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_point()

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_jitter()

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter()

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  )

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  ) +
  scale_color_colorblind()

Multivariate analysis

Bechdel

Load the Bechdel test data with read_csv():

bechdel <- read_csv("https://sta199-s24.github.io/data/bechdel.csv")


View the column names() of the bechdel data:

names(bechdel)
[1] "title"       "year"        "gross_2013"  "budget_2013" "roi"        
[6] "binary"      "clean_test" 

ROI by test result

What about this plot makes it difficult to evaluate how ROI varies by Bechdel test result?

ggplot(bechdel, aes(x = roi, y = clean_test, color = binary)) +
  geom_boxplot()

Movies with high ROI

What are the movies with highest ROI?

bechdel |>
  filter(roi > 400) |>
  select(title, roi, budget_2013, gross_2013, year, clean_test)
# A tibble: 3 × 6
  title                     roi budget_2013 gross_2013  year clean_test
  <chr>                   <dbl>       <dbl>      <dbl> <dbl> <chr>     
1 Paranormal Activity      671.      505595  339424558  2007 dubious   
2 The Blair Witch Project  648.      839077  543776715  1999 ok        
3 El Mariachi              583.       11622    6778946  1992 nowomen   

ROI by test result

Zoom in: What about this plot makes it difficult to evaluate how ROI varies by Bechdel test result?

ggplot(bechdel, aes(x = roi, y = clean_test, color = binary)) +
  geom_boxplot() +
  coord_cartesian(xlim = c(0, 15))

Application exercise

ae-03-duke-forest

Go to the project navigator in RStudio (top right corner of your RStudio window) and open the project called ae. If there are any uncommitted files, commit them, and then click Pull.


When you’re finished, don’t forget to stage changes, commit with a message, and push to Github.