# A tibble: 1 × 1
median_roi
<dbl>
1 3.91
Lecture 5
2024-05-21
to today’s data wrangling pipelines…
What does this plot say about return-on-investment on movies that pass the Bechdel test?
mutate()
function in Lab 1 - Q7”)Avoid long lines of code.
Label code chunks.
label
s.Render (i.e. re-create the PDF), stage, commit, and push regularly.
tibble
sTo demonstrate data wrangling we will use flights
, a tibble in the nycflights13 R package.
The data set includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.
flights
– dimensionsflights
– first lookRows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
tibble
vs. data frame, or, FYIR
data framedplyr
Primary package in the tidyverse for data wrangling
slice()
: chooses rows based on locationfilter()
:chooses rows based on column valuesarrange()
: changes the order of the rowssample_n()
: take a random subset of the rowsslice()
Display the first five rows of flights
:
flights
data frame, and then,
# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
filter()
Filter for flights to RDU in flights
:
flights
data frame, and then,
dest
is equal to RDU
# A tibble: 8,163 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1240 1235 5 1415 1415
9 2013 1 1 1317 1325 -8 1454 1505
10 2013 1 1 1449 1450 -1 1651 1640
# ℹ 8,153 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The pipe, |>
, is an operator (a tool) for passing information from one process to another.
Use |>
mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code “in English”, say “and then” whenever you see a pipe.
Which airline is the most delayed, on average, arriving in RDU from a New York City airport?
Goal:
# A tibble: 5 × 2
carrier mean_arr_delay
<chr> <dbl>
1 UA 56
2 B6 13.3
3 EV 12.2
4 9E 9.67
5 MQ 8.82
1flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 8,163 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1240 1235 5 1415 1415
9 2013 1 1 1317 1325 -8 1454 1505
10 2013 1 1 1449 1450 -1 1651 1640
# ℹ 8,153 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 8,163 × 19
# Groups: carrier [5]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 800 810 -10 949 955
2 2013 1 1 832 840 -8 1006 1030
3 2013 1 1 851 851 0 1032 1036
4 2013 1 1 917 920 -3 1052 1108
5 2013 1 1 1024 1030 -6 1204 1215
6 2013 1 1 1127 1129 -2 1303 1309
7 2013 1 1 1157 1205 -8 1342 1345
8 2013 1 1 1240 1235 5 1415 1415
9 2013 1 1 1317 1325 -8 1454 1505
10 2013 1 1 1449 1450 -1 1651 1640
# ℹ 8,153 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
1flights |>
2 filter(dest == "RDU") |>
3 group_by(carrier) |>
4 summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE))
# A tibble: 5 × 2
carrier mean_arr_delay
<chr> <dbl>
1 9E 9.67
2 B6 13.3
3 EV 12.2
4 MQ 8.82
5 UA 56
1flights |>
2 filter(dest == "RDU") |>
3 group_by(carrier) |>
4 summarize(mean_arr_delay = mean(arr_delay, na.rm = TRUE)) |>
5 arrange(desc(mean_arr_delay))
# A tibble: 5 × 2
carrier mean_arr_delay
<chr> <dbl>
1 UA 56
2 B6 13.3
3 EV 12.2
4 9E 9.67
5 MQ 8.82
select()
: changes whether or not a column is included.rename()
: changes the name of columns.mutate()
: changes the values of columns and creates new columns.summarize()
: collapses a group into a single row.count()
: count unique values of one or more variables.group_by()
: perform calculations separately for each value of a variableae-04-flights-wrangling
Go to the project navigator in RStudio (top right corner of your RStudio window) and open the project called ae
.
If there are any uncommitted files, commit them, and then click Pull.
Open the file called ae-04-flights-wrangling.qmd
and render it.
When you’re finished, don’t forget to render your PDF, stage changes, commit with a message, and push to Github.
+
, and indent the next line.operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
operator | definition |
---|---|
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? (only makes sense if x is TRUE or FALSE ) |