library(tidyverse)
library(readxl)
AE 08: Data import
Suggested answers
Important
These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.
Packages
We will use the following two packages in this application exercise.
- tidyverse: For data import, wrangling, and visualization.
- readxl: For importing data from Excel.
Part 1: Hollywood relationships
- Demo: Load the data from https://sta199-s24.github.io/data/age_gaps.csv and assign it to
age_gaps
. Confirm that this new object appears in your Environment tab.
<- read_csv("https://sta199-s24.github.io/data/age_gaps.csv") age_gaps
Rows: 1155 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): movie_name, director, actor_1_name, actor_2_name, character_1_gend...
dbl (5): release_year, age_difference, couple_number, actor_1_age, actor_2_age
date (2): actor_1_birthdate, actor_2_birthdate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
- Your turn (5 minutes): Split the data into three – where woman is older, where man is older, where they are the same age. Save these subsets as two appropriately named data frames. Remember: Use concise and evocative names. Confirm that these new objects appear in your Environment tab and that the sum of the number of observations in the two new data frames add to the number of observations in the original data frame.
<- age_gaps |>
age_gaps mutate(older = case_when(
== "woman" & actor_1_age > actor_2_age ~ "woman older",
character_1_gender == "woman" & actor_2_age > actor_1_age ~ "woman older",
character_2_gender == "man" & actor_1_age > actor_2_age ~ "man older",
character_1_gender == "man" & actor_2_age > actor_1_age ~ "man older",
character_2_gender == actor_2_age ~ "same age"
actor_1_age
)
)
<- age_gaps |> filter(older == "woman older")
woman_older <- age_gaps |> filter(older == "man older")
man_older <- age_gaps |> filter(older == "same age")
same_age
nrow(woman_older) + nrow(man_older) + nrow(same_age)) == nrow(age_gaps) (
[1] TRUE
- Demo: Write out the three new datasets you created into the
data
folder:
write_csv(woman_older, file = "data/woman_older.csv")
write_csv(man_older, file = "data/man_older.csv")
write_csv(same_age, file = "data/same_age.csv")
Part 2: Sales
Sales data are stored in an Excel file that looks like the following:
- Demo: Read in the Excel file called
sales.xlsx
from thedata-raw/
folder such that it looks like the following.
<- read_excel(
sales_raw "data/sales.xlsx",
skip = 3,
col_names = c("id", "n")
)
- Demo - Stretch goal: Manipulate the sales data such such that it looks like the following.
<- sales_raw |>
sales mutate(
is_brand_name = str_detect(id, "Brand"),
brand = if_else(is_brand_name, id, NA)
|>
) fill(brand) |>
filter(!is_brand_name) |>
select(brand, id, n)
sales
# A tibble: 7 × 3
brand id n
<chr> <chr> <chr>
1 Brand 1 1234 8
2 Brand 1 8721 2
3 Brand 1 1822 3
4 Brand 2 3333 1
5 Brand 2 2156 3
6 Brand 2 3987 6
7 Brand 2 3216 5
- Question: Why should we bother with writing code for reading the data in by skipping columns and assigning variable names as well as cleaning it up in multiple steps instead of opening the Excel file and editing the data in there to prepare it for a clean import?
Because the code allows us to struggle once and re-use for future datasets and leaves a transparent trail of our modifications while manipulating the data in Excel directly is neither reproducible nor reusable.