library(tidyverse) # for data wrangling and visualization
library(scales) # for pretty axis breaks
AE 06: Joining country populations with continents
Goal
Our ultimate goal in this application exercise is to create a bar plot of total populations of continents, where the input data are:
- Countries and populations
- Countries and continents
Data
Countries and populations
These data come from The World Bank and reflect population counts as of 2022.
<- read_csv("https://sta199-s24.github.io/data/world-pop-2022.csv") population
Let’s take a look at the data.
population
# A tibble: 217 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 2022 41129.
2 Albania 2022 2778.
3 Algeria 2022 44903.
4 American Samoa 2022 44.3
5 Andorra 2022 79.8
6 Angola 2022 35589.
7 Antigua and Barbuda 2022 93.8
8 Argentina 2022 46235.
9 Armenia 2022 2780.
10 Aruba 2022 106.
# ℹ 207 more rows
Continents
These data come from Our World in Data.
<- read_csv("https://sta199-s24.github.io/data/continents.csv") continents
Let’s take a look at the data.
continents
# A tibble: 285 × 4
entity code year continent
<chr> <chr> <dbl> <chr>
1 Abkhazia OWID_ABK 2015 Asia
2 Afghanistan AFG 2015 Asia
3 Akrotiri and Dhekelia OWID_AKD 2015 Asia
4 Aland Islands ALA 2015 Europe
5 Albania ALB 2015 Europe
6 Algeria DZA 2015 Africa
7 American Samoa ASM 2015 Oceania
8 Andorra AND 2015 Europe
9 Angola AGO 2015 Africa
10 Anguilla AIA 2015 North America
# ℹ 275 more rows
Exercises
Think out loud:
- Which variable(s) will we use to join the
population
andcontinents
data frames?
Add response here.
- We want to create a new data frame that keeps all rows and columns from
population
and brings in the corresponding information fromcontinents
. Which join function should we use?
Add response here.
- Which variable(s) will we use to join the
Demo: Join the two data frames and name assign the joined data frame to a new data frame
population_continents
.
# add code here
- Demo: Take a look at the newly created
population_continent
data frame. There are some countries that were not incontinents
. First, identify which countries these are (they will haveNA
values forcontinent
).
# add code here
- Demo: All of these countries are actually in the
continents
data frame, but under different names. So, let’s clean that data first by updating the country names in thepopulation
data frame in a way they will match thecontinents
data frame, and then joining them, using acase_when()
statement inmutate()
. At the end, check that all countries now have continent information.
# add code here
- Think out loud: Which continent do you think has the highest population? Which do you think has the second highest? The lowest?
Add your response here.
- Demo: Create a new data frame called
population_summary
that contains a row for each continent and a column for the total population for that continent, in descending order of population. Note that the function for calculating totals in R issum()
.
# add code here
- Your turn: Make a bar plot with total population on the y-axis and continent on the x-axis, where the height of each bar represents the total population in that continent.
# add code here
- Your turn: Recreate the following plot, which is commonly referred to as a lollipop plot. Hint: Start with the points, then try adding the
segment
s, then add axis labels andcaption
, and finally, as a stretch goal, update the x scale (which will require a function we haven’t introduced in lectures or labs yet!).
# add code here
- Think out loud: What additional improvements would you like to make to this plot.
Add your response here.