library(tidyverse)
library(nycflights13)
AE 04: NYC flights + data wrangling
Suggested answers
Important
These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.
Exercise 1
Your turn: Fill in the blanks:
The flights
data frame has 336776 rows. Each row represents a _flight_.
Exercise 2
Your turn: What are the names of the variables in flights
.
names(flights)
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
Exercise 3 - select()
- Demo: Make a data frame that only contains the variables
dep_delay
andarr_delay
.
|>
flights select(dep_delay, arr_delay)
# A tibble: 336,776 × 2
dep_delay arr_delay
<dbl> <dbl>
1 2 11
2 4 20
3 2 33
4 -1 -18
5 -6 -25
6 -4 12
7 -5 19
8 -3 -14
9 -3 -8
10 -2 8
# ℹ 336,766 more rows
- Demo: Make a data frame that keeps every variable except
dep_delay
.
|>
flights select(-dep_delay)
# A tibble: 336,776 × 18
year month day dep_time sched_dep_time arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 830 819 11
2 2013 1 1 533 529 850 830 20
3 2013 1 1 542 540 923 850 33
4 2013 1 1 544 545 1004 1022 -18
5 2013 1 1 554 600 812 837 -25
6 2013 1 1 554 558 740 728 12
7 2013 1 1 555 600 913 854 19
8 2013 1 1 557 600 709 723 -14
9 2013 1 1 557 600 838 846 -8
10 2013 1 1 558 600 753 745 8
# ℹ 336,766 more rows
# ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
- Demo: Make a data frame that includes all variables between
year
throughdep_delay
(inclusive). These are all variables that provide information about the departure of each flight.
|>
flights select(year:dep_delay)
# A tibble: 336,776 × 6
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# ℹ 336,766 more rows
- Demo: Use the
select
helpercontains()
to make a data frame that includes the variables associated with the arrival, i.e., contains the string"arr\_"
in the name.
|>
flights select(contains("arr_"))
# A tibble: 336,776 × 3
arr_time sched_arr_time arr_delay
<int> <int> <dbl>
1 830 819 11
2 850 830 20
3 923 850 33
4 1004 1022 -18
5 812 837 -25
6 740 728 12
7 913 854 19
8 709 723 -14
9 838 846 -8
10 753 745 8
# ℹ 336,766 more rows
Exercise 4 - slice()
- Demo: Display the first five rows of the
flights
data frame.
|>
flights slice(1:5)
# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Demo: Display the last two rows of the
flights
data frame.
|>
flights slice((n()-1):n())
# A tibble: 2 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1159 NA NA 1344
2 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Exercise 5 - arrange()
- Demo: Let’s arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame.
|>
flights arrange(dep_delay)
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 12 7 2040 2123 -43 40 2352
2 2013 2 3 2022 2055 -33 2240 2338
3 2013 11 10 1408 1440 -32 1549 1559
4 2013 1 11 1900 1930 -30 2233 2243
5 2013 1 29 1703 1730 -27 1947 1957
6 2013 8 9 729 755 -26 1002 955
7 2013 10 23 1907 1932 -25 2143 2143
8 2013 3 30 2030 2055 -25 2213 2250
9 2013 3 2 1431 1455 -24 1601 1631
10 2013 5 5 934 958 -24 1225 1309
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Question: What does it mean for the
dep_delay
to have a negative value?
Arrived early.
- Demo: Arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top.
|>
flights arrange(desc(dep_delay))
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 6 27 959 1900 899 1236 2226
9 2013 7 22 2257 759 898 121 1026
10 2013 12 5 756 1700 896 1058 2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Your turn: Create a data frame that only includes the plane tail number (
tailnum
), carrier (carrier
), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum
) for this flight?
|>
flights select(tailnum, carrier, dep_delay) %>%
arrange(dep_delay) |>
slice(1)
# A tibble: 1 × 3
tailnum carrier dep_delay
<chr> <chr> <dbl>
1 N592JB B6 -43
Exercise 6 - filter()
- Demo: Filter the data frame by selecting the rows where the destination airport is RDU.
|>
flights filter(dest == "RDU")
# A tibble: 8,163 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1240 1235 5 1415 1415
9 2013 1 1 1317 1325 -8 1454 1505
10 2013 1 1 1449 1450 -1 1651 1640
# ℹ 8,153 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0.
|>
flights filter(dest == "RDU", arr_delay < 0)
# A tibble: 4,232 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1317 1325 -8 1454 1505
9 2013 1 1 1505 1510 -5 1654 1655
10 2013 1 1 1800 1800 0 1945 1951
# ℹ 4,222 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Your turn: Describe what the code is doing in words.
|>
flights filter(
%in% c("RDU", "GSO"),
dest < 0 | dep_delay < 0
arr_delay )
# A tibble: 6,203 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1317 1325 -8 1454 1505
9 2013 1 1 1449 1450 -1 1651 1640
10 2013 1 1 1505 1510 -5 1654 1655
# ℹ 6,193 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Hint: Logical operators in R:
operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? (only makes sense if x is TRUE or FALSE ) |
Exercise 7 - count()
- Demo: Create a frequency table of the destination locations for flights from New York.
|>
flights count(dest)
# A tibble: 105 × 2
dest n
<chr> <int>
1 ABQ 254
2 ACK 265
3 ALB 439
4 ANC 8
5 ATL 17215
6 AUS 2439
7 AVL 275
8 BDL 443
9 BGR 375
10 BHM 297
# ℹ 95 more rows
- Demo: In which month was there the fewest number of flights? How many flights were there in that month?
|>
flights count(month) |>
filter(n == min(n))
# A tibble: 1 × 2
month n
<int> <int>
1 2 24951
- Your turn: On which date (month + day) was there the largest number of flights? How many flights were there on that day?
|>
flights count(month, day) |>
filter(n == max(n))
# A tibble: 1 × 3
month day n
<int> <int> <int>
1 11 27 1014
Exercise 8 - mutate()
- Demo: Convert
air_time
(minutes in the air) to hours and then create a new variable,mph
, the miles per hour of the flight.
|>
flights mutate(
hours = air_time / 60,
mph = distance / hours
|>
) select(air_time, distance, hours, mph)
# A tibble: 336,776 × 4
air_time distance hours mph
<dbl> <dbl> <dbl> <dbl>
1 227 1400 3.78 370.
2 227 1416 3.78 374.
3 160 1089 2.67 408.
4 183 1576 3.05 517.
5 116 762 1.93 394.
6 150 719 2.5 288.
7 158 1065 2.63 404.
8 53 229 0.883 259.
9 140 944 2.33 405.
10 138 733 2.3 319.
# ℹ 336,766 more rows
- Your turn: Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
|>
flights count(month) |>
mutate(perc = n / sum(n) * 100)
# A tibble: 12 × 3
month n perc
<int> <int> <dbl>
1 1 27004 8.02
2 2 24951 7.41
3 3 28834 8.56
4 4 28330 8.41
5 5 28796 8.55
6 6 28243 8.39
7 7 29425 8.74
8 8 29327 8.71
9 9 27574 8.19
10 10 28889 8.58
11 11 27268 8.10
12 12 28135 8.35
- Demo: Create a new variable,
rdu_bound
, which indicates whether the flight is to RDU or not. Then, for each departure airport (origin
), calculate what proportion of flights originating from that airport are to RDU.
|>
flights mutate(rdu_bound = if_else(dest == "RDU", "Yes", "No")) |>
count(origin, rdu_bound) |>
group_by(origin) |>
mutate(prop = n / sum(n)) |>
filter(rdu_bound == "Yes")
# A tibble: 3 × 4
# Groups: origin [3]
origin rdu_bound n prop
<chr> <chr> <int> <dbl>
1 EWR Yes 1482 0.0123
2 JFK Yes 3100 0.0279
3 LGA Yes 3581 0.0342
Exercise 9 - summarize()
- Demo: Find mean arrival delay for all flights.
|>
flights summarize(mean_dep_delay = mean(dep_delay))
# A tibble: 1 × 1
mean_dep_delay
<dbl>
1 NA
Exercise 10 - group_by()
- Demo: Find mean arrival delay for for each month.
|>
flights group_by(month) |>
summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE))
# A tibble: 12 × 2
month mean_arr_delay
<int> <dbl>
1 1 6.13
2 2 5.61
3 3 5.81
4 4 11.2
5 5 3.52
6 6 16.5
7 7 16.7
8 8 6.04
9 9 -4.02
10 10 -0.167
11 11 0.461
12 12 14.9
- Your turn: What is the median departure delay for each airports around NYC (
origin
)? Which airport has the shortest median departure delay?
|>
flights group_by(origin) |>
summarize(med_dep_delay = median(dep_delay, na.rm = TRUE))
# A tibble: 3 × 2
origin med_dep_delay
<chr> <dbl>
1 EWR -1
2 JFK -1
3 LGA -3
Additional Practice
- Create a new dataset that only contains flights that do not have a missing departure time. Include the columns
year
,month
,day
,dep_time
,dep_delay
, anddep_delay_hours
(the departure delay in hours). Hint: Note you may need to usemutate()
to make one or more of these variables.
# add code here
- For each airplane (uniquely identified by
tailnum
), use agroup_by()
paired withsummarize()
to find the sample size, mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.
# add code here