No late work is accepted. Always turn something in! You can use the code chunk option #| eval: FALSE or (worst case scenario) turn in the .qmd file if your PDF won’t render.
The AEs are very similar to labs. If you can’t recall how to do something, return to the AE!
Render after every question and don’t go on to the next question if you aren’t able to render- fix the problem first
Let’s go over Lab 2, Q6a - don’t overwrite the dataset!
# A tibble: 437 × 4
county state percbelowpoverty percollege
<chr> <chr> <dbl> <dbl>
1 ADAMS IL 13.2 19.6
2 ALEXANDER IL 32.2 11.2
3 BOND IL 12.1 17.0
4 BOONE IL 7.21 17.3
5 BROWN IL 13.5 14.5
6 BUREAU IL 10.4 18.9
7 CALHOUN IL 15.1 11.9
8 CARROLL IL 11.7 16.2
9 CASS IL 13.9 14.1
10 CHAMPAIGN IL 15.6 41.3
# ℹ 427 more rows
# A tibble: 437 × 28
county state percbelowpoverty percollege PID area poptotal popdensity
<chr> <chr> <dbl> <dbl> <int> <dbl> <int> <dbl>
1 ADAMS IL 13.2 19.6 561 0.052 66090 1271.
2 ALEXANDER IL 32.2 11.2 562 0.014 10626 759
3 BOND IL 12.1 17.0 563 0.022 14991 681.
4 BOONE IL 7.21 17.3 564 0.017 30806 1812.
5 BROWN IL 13.5 14.5 565 0.018 5836 324.
6 BUREAU IL 10.4 18.9 566 0.05 35688 714.
7 CALHOUN IL 15.1 11.9 567 0.017 5322 313.
8 CARROLL IL 11.7 16.2 568 0.027 16805 622.
9 CASS IL 13.9 14.1 569 0.024 13437 560.
10 CHAMPAIGN IL 15.6 41.3 570 0.058 173025 2983.
# ℹ 427 more rows
# ℹ 20 more variables: popwhite <int>, popblack <int>, popamerindian <int>,
# popasian <int>, popother <int>, percwhite <dbl>, percblack <dbl>,
# percamerindan <dbl>, percasian <dbl>, percother <dbl>, popadults <int>,
# perchsd <dbl>, percprof <dbl>, poppovertyknown <int>,
# percpovertyknown <dbl>, percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>
Recap: pivoting
Data sets can’t be labeled as wide or long but they can be made wider or longer for a certain analysis that requires a certain format
When pivoting longer, variable names that turn into values are characters by default. If you need them to be in another format, you need to explicitly make that transformation, which you can do so within the pivot_longer() function.
You can tweak a plot forever, but at some point the tweaks are likely not very productive. However, you should always be critical of defaults (however pretty they might be) and see if you can improve the plot to better portray your data / results / what you want to communicate.
Joining datasets
Why join?
Suppose we want to answer questions like:
Is there a relationship between
- number of QS courses taken
- having scored a 4 or 5 on the AP stats exam
- motivation for taking course
- …
and performance in this course?”
Each of these would require joining class performance data with an outside data source so we can have all relevant information (columns) in a single data frame.
Setup
For the next few slides…
x <-tibble(id =c(1, 2, 3),value_x =c("x1", "x2", "x3") )x
# A tibble: 3 × 2
id value_x
<dbl> <chr>
1 1 x1
2 2 x2
3 3 x3
y <-tibble(id =c(1, 2, 4),value_y =c("y1", "y2", "y4") )y
# A tibble: 3 × 2
id value_y
<dbl> <chr>
1 1 y1
2 2 y2
3 4 y4
# A tibble: 2 × 3
id value_x value_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
semi_join()
semi_join(x, y)
# A tibble: 2 × 2
id value_x
<dbl> <chr>
1 1 x1
2 2 x2
anti_join()
anti_join(x, y)
# A tibble: 1 × 2
id value_x
<dbl> <chr>
1 3 x3
Example: Passenger capacity
nycflights13 & airport capacity
You’ve previously seen the flights data available in the nycflights13 package which details all flights from one of the 3 major NYC airports in 2013.
Today we would like to answer a specific question:
What was the passenger capacity (i.e., maximum number of passengers) that could have flown out of the three airports in 2013?
To answer this we will need to know how many passenger seats each plane had available - each flight record has a tailnum which is a unique identifier for the plane, this can be linked to the planes data set which has the number of available seats for each plane.