Meet the toolkit

Lecture 1

2024-05-16

Warm-up

Course homepage

Let’s take a tour!

sta199-summer24.github.io/

Collaboration policy

  • Only work that is clearly assigned as team work should be completed collaboratively.

  • Labs must be completed individually. You may not directly share answers / code with others, however you are welcome to discuss the problems in general and ask for advice.

  • Exams must be completed individually. You may not discuss any aspect of the exam with peers. If you have questions, post as private questions on the course forum, only the teaching team will see and answer.

Sharing / reusing code policy

  • We are aware that a huge volume of code is available on the web, and many tasks may have solutions posted

  • Unless explicitly stated otherwise, this course’s policy is that you may make use of any online resources (e.g. RStudio Community, StackOverflow, etc.) but you must explicitly cite where you obtained any code you directly use or use as inspiration in your solution(s).

  • Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of source

Use of generative AI

  • Treat generative AI, such as ChatGPT, the same as other online resources.

  • Guiding principles:

    • (1) Cognitive dimension: Working with AI should not reduce your ability to think clearly. We will practice using AI to facilitate—rather than hinder—learning.

    • (2) Ethical dimension: Students using AI should be transparent about their use and make sure it aligns with academic integrity.

  • ✅ AI tools for code: You may make use of the technology for coding examples on assignments; if you do so, you must explicitly cite where you obtained the code. See the syllabus for guidelines for citing AI-generated content.

  • ❌ AI tools for narrative: Unless instructed otherwise, you may not use generative AI to write narrative on assignments. In general, you may use generative AI as a resource as you complete assignments but not to answer the exercises for you.

Academic integrity

To uphold the Duke Community Standard:

  • I will not lie, cheat, or steal in my academic endeavors;

  • I will conduct myself honorably in all my endeavors; and

  • I will act if the Standard is compromised.

Most importantly!

Ask if you’re not sure if something violates a policy!

Five tips for success

  1. Complete all the preparation work before class.

  2. Ask questions.

  3. Do the readings.

  4. Do the lab.

  5. Don’t procrastinate!

Course toolkit

Course toolkit

Course operation

Doing data science

  • Computing:
    • R
    • RStudio
    • tidyverse
    • Quarto
  • Version control and collaboration:
    • Git
    • GitHub

Toolkit: Computing

Learning goals

By the end of the course, you will be able to…

  • gain insight from data
  • gain insight from data, reproducibly
  • gain insight from data, reproducibly, using modern programming tools and techniques
  • gain insight from data, reproducibly and collaboratively, using modern programming tools and techniques
  • gain insight from data, reproducibly (with literate programming and version control) and collaboratively, using modern programming tools and techniques

Reproducible data analysis

Reproducibility checklist

What does it mean for a data analysis to be “reproducible”?

Short-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

  • Scriptability \(\rightarrow\) R
  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
  • Version control \(\rightarrow\) Git / GitHub

R and RStudio

R and RStudio

R logo

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics
  • It’s easily extensible with packages

RStudio logo

  • RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
  • RStudio is not a requirement for programming with R, but it’s very commonly used by R programmers and data scientists

R vs. RStudio

On the left: a car engine. On the right: a car dashboard. The engine is labelled R. The dashboard is labelled RStudio.

R packages

  • Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data1

  • As of 15 January 2023, there are 20,252 R packages available on CRAN (the Comprehensive R Archive Network)2

  • We’re going to work with a small (but important) subset of these!

Tour: R + RStudio

AE 01:

Sit back and enjoy the show!

To find the AE, go to the course GitHub organization and find ae-01-meet-the-penguins.

Tour recap: R + RStudio

A short list (for now) of R essentials

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
  • Packages are installed with the install.packages() function and loaded with the library function, once per session:

R essentials (continued)

  • Columns (variables) in data frames are accessed with $:
  • Object documentation can be accessed with ?

tidyverse

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

tidyverse.org

  • The tidyverse is an opinionated collection of R packages designed for data science
  • All packages share an underlying philosophy and a common grammar

Quarto

Quarto

  • Fully reproducible reports – each time you render the analysis is ran from the beginning
  • Code goes in chunks narrative goes outside of chunks
  • A visual editor for a familiar / Google docs-like editing experience

Tour: Quarto

Option 1:

Sit back and enjoy the show!

To find the AE, go to the course GitHub organization and find ae-01-meet-the-penguins.

Tour recap: Quarto

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.

Environments

Important

The environment of your Quarto document is separate from the Console!

Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

How will we use Quarto?

  • Every application exercise, lab, project, etc. is an Quarto document
  • You’ll always have a template Quarto document to start with
  • The amount of scaffolding in the template will decrease over the semester