Data Analysis: Part 1

BAA-POCS Professional Development Series

Author

Dr. Maria Tackett

Published

December 4, 2025

Introductions

  • Name

  • School

  • Year in school

  • Description of your project

Topics

Today

  • Goal of data analysis
  • Identify types of study designs and data
  • Identify types of variables
  • Introduce R and RStudio
  • Visualize and summarize variables

Next time

  • Statistical inference
  • Relationships between variables
  • Coding in R
  • Other topics?

Goal of data analysis

“Information is what we want, but data are what we’ve got.” - Modern Data Science with R, Chapter 1

“Scientists seek to answer questions using rigorous methods and careful observations. These observations – collected from the likes of field notes, surveys, and experiments – form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data.”” - Introduction to Modern Statistics, Chapter 1

NoteYour turn!

How are you using data analysis in your research?

Data analysis workflow

Source: R for Data Science

Data set: North Carolina counties

We will use data about the 100 counties in North Carolina. The data were collected from the from Census Quick Facts and is available in the usdata R package. Let’s look at the first 10 rows of data.

Code
nc_counties |> 
slice(1:10) |>
kable(digits = 3)
name state pop2000 pop2010 pop2017 pop_change poverty homeownership multi_unit unemployment_rate metro median_edu per_capita_income median_hh_income smoking_ban
Alamance County North Carolina 130800 151131 162391 5.16 17.6 68.1 17.1 4.30 yes some_college 25374.90 44281 none
Alexander County North Carolina 33603 37198 37286 0.53 14.7 79.9 2.2 3.67 yes hs_diploma 22385.82 44523 none
Alleghany County North Carolina 10677 11155 11031 1.02 21.0 74.0 6.2 5.16 no hs_diploma 21280.18 38944 none
Anson County North Carolina 25275 26948 24991 -3.79 22.7 71.0 4.9 5.31 no hs_diploma 19798.37 38123 none
Ashe County North Carolina 24384 27281 26957 0.25 19.4 79.2 4.4 4.18 no some_college 24350.00 40293 none
Avery County North Carolina 17167 17797 17536 -0.39 14.7 72.8 18.1 4.35 no some_college 26362.67 37109 none
Beaufort County North Carolina 44958 47759 47088 -0.64 19.1 73.4 9.2 5.13 no some_college 23442.11 41101 NA
Bertie County North Carolina 19773 21282 19224 -5.53 22.0 76.9 2.2 6.08 no hs_diploma 19123.28 31287 none
Bladen County North Carolina 32278 35190 33478 -3.53 24.5 69.0 5.5 5.97 no hs_diploma 20570.82 32396 none
Brunswick County North Carolina 73143 107431 130897 13.82 14.1 77.5 9.3 5.66 yes some_college 29150.66 51164 none

Understanding the data

This is a data frame (like a spreadsheet). It is also an example of tidy data that is ready for analysis. In tidy data

  1. Each row is an observation
  2. Each column is variable (characteristic of the observation)
  3. The table contains one type of observational unit
NoteYour turn!
  1. What do the rows represent in the North Carolina data?
  2. What do the columns represent?

Study designs

It is important to understand data provenance (data origin and history of changes), because it helps us understand the scope of the conclusions that can be drawn from the data. See “Datasheets for Datasets” by Gebru et al. (2021) for more on data provenance and documentation.

A key piece of data provenance is how the data were collected, called the study design. There are two types of study designs: Experimental and Observational.

  • Experimental study: Researchers (randomly) assign subjects to specific treatments.

    • Subjects generally the same across treatment groups.

    • Can make causal claims (e.g., Treatment X causes Y outcome), because the effect of confounding factors is reduced. The only difference between the groups is the treatment that is applied.

  • Observational study: Researchers do not assign subjects to treatment.

    • Subjects are likely different across treatment groups

    • Challenging to make causal claims, because there could be confounding factors that affect subjects’ behavior.

Below is a chart from Introduction to Modern Statistics (Chapter 2) showing how the scope of conclusions relates to the study design.

Source: Introduction to Modern Statistics
NoteYour turn!
  • What type of study design was used to collect the North Carolina counties data?

  • Below is graph of the relationship between population change from 2010 to 2017 and per capita (per person) income.

Code
ggplot(data = nc_counties, aes(x = pop_change, y = per_capita_income))  + 
geom_point() +
labs(x = "Population change 2010 to 2017", 
    y = "Per capita income")

  • TRUE or FALSE. More people moving to a county causes an increase in the income per person.

Types of variables

It’s important to know each variable’s type, because the type informs how we analyze the variable.

  • Numeric (quantitative)

    • Continuous (e.g., height in inches)

    • Discrete (e.g., number of siblings)

  • Categorical

    • Nominal (e.g., hair color)

    • Ordinal (e.g., Freshmen, Sophomore, Junior, Senior)

  • Identifier (e.g., Student ID number)

NC counties: Types of variables

Let’s look at the first 10 rows of the NC counties data again:

Code
nc_counties |> 
slice(1:10) |>
kable(digits = 3)
name state pop2000 pop2010 pop2017 pop_change poverty homeownership multi_unit unemployment_rate metro median_edu per_capita_income median_hh_income smoking_ban
Alamance County North Carolina 130800 151131 162391 5.16 17.6 68.1 17.1 4.30 yes some_college 25374.90 44281 none
Alexander County North Carolina 33603 37198 37286 0.53 14.7 79.9 2.2 3.67 yes hs_diploma 22385.82 44523 none
Alleghany County North Carolina 10677 11155 11031 1.02 21.0 74.0 6.2 5.16 no hs_diploma 21280.18 38944 none
Anson County North Carolina 25275 26948 24991 -3.79 22.7 71.0 4.9 5.31 no hs_diploma 19798.37 38123 none
Ashe County North Carolina 24384 27281 26957 0.25 19.4 79.2 4.4 4.18 no some_college 24350.00 40293 none
Avery County North Carolina 17167 17797 17536 -0.39 14.7 72.8 18.1 4.35 no some_college 26362.67 37109 none
Beaufort County North Carolina 44958 47759 47088 -0.64 19.1 73.4 9.2 5.13 no some_college 23442.11 41101 NA
Bertie County North Carolina 19773 21282 19224 -5.53 22.0 76.9 2.2 6.08 no hs_diploma 19123.28 31287 none
Bladen County North Carolina 32278 35190 33478 -3.53 24.5 69.0 5.5 5.97 no hs_diploma 20570.82 32396 none
Brunswick County North Carolina 73143 107431 130897 13.82 14.1 77.5 9.3 5.66 yes some_college 29150.66 51164 none

Link to documentation: https://openintrostat.github.io/usdata/reference/county.html

NoteYour turn!

Identify an example of each variable type in the NC counties data:

  • Continuous variable

  • Discrete variable

  • Nominal variable

  • Ordinal variable

  • Identifer variable

Describing distributions

Describing distributions of numeric variables

  • Shape

    • skewness: right-skewed (tail to the right), left-skewed (tail to the left), symmetric

    • modality: unimodal (one peak), bimodal (two peaks), multimodal (three or more peaks), uniform (no peaks)

  • Center: mean (average), median (50th percentile)

  • Spread: range (max - min), standard deviation (average distance from the mean), inter-quartile range (75th percentile - 25th percentile)

  • Unusual observations

We can use a histogram to visualize the distributions of numeric variables. Below is the a histogram of unemployment_rate.

Code
ggplot(data = nc_counties, aes(x = unemployment_rate)) + 
geom_histogram(color = "black", fill = "steelblue") + 
labs(x = "Unemployment rate")

Summary statistics for unemployment rate are below:

Code
nc_counties |>
skim(unemployment_rate) |> 
select(numeric.mean:numeric.p100)
# A tibble: 1 × 7
  numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75
         <dbl>      <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
1         5.02       1.03       3.49        4.30        4.72        5.58
# ℹ 1 more variable: numeric.p100 <dbl>
NoteYour turn!

Describe the distribution of unemployment rate:

  • shape
  • center
  • spread
  • unusual observations (if any)

Describing distributions of categorical variables

We describe the distribution of categorical variables using visualizations and a frequency table that contains the number and/or proportion of observations in each category.

Below is a bar chart and frequency table showing the distribution of median_edu, the median education level (2013 - 2017):

Code
ggplot(data = nc_counties, aes(x = median_edu)) +
geom_bar(color = "black", fill = "darkcyan") + 
labs(x = "Median education (2013 - 2017)")

# A tibble: 3 × 3
  median_edu       n proportion
  <fct>        <int>      <dbl>
1 hs_diploma      43       0.43
2 some_college    55       0.55
3 bachelors        2       0.02
NoteYour turn!

Describe the distribution of median education level.

We will look at more visualizations and summary statistics in the next session.

Computing in R and RStudio1

Reproducibility

NoteYour turn!

What does it mean for an analysis to be “reproducible”?

Near-term goals

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

R and RStudio

R:

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics
  • It’s easily extensible with packages

RStudio:

  • RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
  • RStudio is not a requirement for programming with R, but it’s very commonly used by R programmers and data scientists

R is like the engine of a car and RStudio is like the inside.

Packages

  • Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data

  • As of September 2020, there are over 16,000 R packages available on CRAN (the Comprehensive R Archive Network)

  • What can do most data analysis tasks using the tidyverse and tidymodels packages.

Accessing RStudio

  • Install R and RStudio on your computer (free): https://posit.co/download/rstudio-desktop/

  • Access RStudio online through Duke containers (free with Duke NetID):

    1. Reserve container:

      • Go to https://cmgr.oit.duke.edu/containers. You will log in using your NetID credentials.

      • Click “Reserve RStudio” to reserve an RStudio container.

      • You only need to reserve a container once per semester.

    2. Open RStudio container:

      • Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.

      • Click RStudio to log into the Docker container. You should now see the RStudio environment.

Tour of RStudio

  • Editor

  • Console

  • Environment

  • Files + Plots + Viewer

Quarto document (.qmd)

  • Fully reproducible reports – the analysis is run from the beginning each time you render

  • Code goes in chunks and narrative goes outside of chunks

  • Visual editor to make document editing experience similar to a word processor (Google docs, Word, Pages, etc.)

  • Can produce multiple types of document using the same Quarto file (e.g., websites, presentations, word documents, academic publications, etc.)

Tour of Quarto document

Go to File -> New File -> Quarto Document

  • YAML

  • Text

  • Output

  • Rendered document

Additional resources

The content in this document is based on the resources listed below. These are great resources for more in-depth discussion of today’s topics and for additional practice.

Footnotes

  1. Content in this section from datasciencebox.org.↩︎