Data Analysis: Part 1

BAA-POCS Professional Development Series

Author

Dr. Maria Tackett

Published

December 4, 2025

Introductions

Name
School
Year in school
Description of your project

Topics

Today

Goal of data analysis
Identify types of study designs and data
Identify types of variables
Introduce R and RStudio
Visualize and summarize variables

Next time

Statistical inference
Relationships between variables
Coding in R
Other topics?

Goal of data analysis

“Information is what we want, but data are what we’ve got.” - Modern Data Science with R, Chapter 1

“Scientists seek to answer questions using rigorous methods and careful observations. These observations – collected from the likes of field notes, surveys, and experiments – form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data.”” - Introduction to Modern Statistics, Chapter 1

Your turn!

How are you using data analysis in your research?

Data analysis workflow

Data set: North Carolina counties

We will use data about the 100 counties in North Carolina. The data were collected from the from Census Quick Facts and is available in the usdata R package. Let’s look at the first 10 rows of data.

Code

nc_counties |> 
slice(1:10) |>
kable(digits = 3)

name	state	pop2000	pop2010	pop2017	pop_change	poverty	homeownership	multi_unit	unemployment_rate	metro	median_edu	per_capita_income	median_hh_income	smoking_ban
Alamance County	North Carolina	130800	151131	162391	5.16	17.6	68.1	17.1	4.30	yes	some_college	25374.90	44281	none
Alexander County	North Carolina	33603	37198	37286	0.53	14.7	79.9	2.2	3.67	yes	hs_diploma	22385.82	44523	none
Alleghany County	North Carolina	10677	11155	11031	1.02	21.0	74.0	6.2	5.16	no	hs_diploma	21280.18	38944	none
Anson County	North Carolina	25275	26948	24991	-3.79	22.7	71.0	4.9	5.31	no	hs_diploma	19798.37	38123	none
Ashe County	North Carolina	24384	27281	26957	0.25	19.4	79.2	4.4	4.18	no	some_college	24350.00	40293	none
Avery County	North Carolina	17167	17797	17536	-0.39	14.7	72.8	18.1	4.35	no	some_college	26362.67	37109	none
Beaufort County	North Carolina	44958	47759	47088	-0.64	19.1	73.4	9.2	5.13	no	some_college	23442.11	41101	NA
Bertie County	North Carolina	19773	21282	19224	-5.53	22.0	76.9	2.2	6.08	no	hs_diploma	19123.28	31287	none
Bladen County	North Carolina	32278	35190	33478	-3.53	24.5	69.0	5.5	5.97	no	hs_diploma	20570.82	32396	none
Brunswick County	North Carolina	73143	107431	130897	13.82	14.1	77.5	9.3	5.66	yes	some_college	29150.66	51164	none

Understanding the data

This is a data frame (like a spreadsheet). It is also an example of tidy data that is ready for analysis. In tidy data

Each row is an observation
Each column is variable (characteristic of the observation)
The table contains one type of observational unit

Your turn!

What do the rows represent in the North Carolina data?
What do the columns represent?

Study designs

It is important to understand data provenance (data origin and history of changes), because it helps us understand the scope of the conclusions that can be drawn from the data. See “Datasheets for Datasets” by Gebru et al. (2021) for more on data provenance and documentation.

A key piece of data provenance is how the data were collected, called the study design. There are two types of study designs: Experimental and Observational.

Experimental study: Researchers (randomly) assign subjects to specific treatments.
- Subjects generally the same across treatment groups.
- Can make causal claims (e.g., Treatment X causes Y outcome), because the effect of confounding factors is reduced. The only difference between the groups is the treatment that is applied.
Observational study: Researchers do not assign subjects to treatment.
- Subjects are likely different across treatment groups
- Challenging to make causal claims, because there could be confounding factors that affect subjects’ behavior.

Below is a chart from Introduction to Modern Statistics (Chapter 2) showing how the scope of conclusions relates to the study design.

Source: *Introduction to Modern Statistics*

Your turn!

What type of study design was used to collect the North Carolina counties data?
Below is graph of the relationship between population change from 2010 to 2017 and per capita (per person) income.

Code

ggplot(data = nc_counties, aes(x = pop_change, y = per_capita_income))  + 
geom_point() +
labs(x = "Population change 2010 to 2017", 
    y = "Per capita income")

TRUE or FALSE. More people moving to a county causes an increase in the income per person.

Types of variables

It’s important to know each variable’s type, because the type informs how we analyze the variable.

Numeric (quantitative)
- Continuous (e.g., height in inches)
- Discrete (e.g., number of siblings)
Categorical
- Nominal (e.g., hair color)
- Ordinal (e.g., Freshmen, Sophomore, Junior, Senior)
Identifier (e.g., Student ID number)

NC counties: Types of variables

Let’s look at the first 10 rows of the NC counties data again:

Code

nc_counties |> 
slice(1:10) |>
kable(digits = 3)

name	state	pop2000	pop2010	pop2017	pop_change	poverty	homeownership	multi_unit	unemployment_rate	metro	median_edu	per_capita_income	median_hh_income	smoking_ban
Alamance County	North Carolina	130800	151131	162391	5.16	17.6	68.1	17.1	4.30	yes	some_college	25374.90	44281	none
Alexander County	North Carolina	33603	37198	37286	0.53	14.7	79.9	2.2	3.67	yes	hs_diploma	22385.82	44523	none
Alleghany County	North Carolina	10677	11155	11031	1.02	21.0	74.0	6.2	5.16	no	hs_diploma	21280.18	38944	none
Anson County	North Carolina	25275	26948	24991	-3.79	22.7	71.0	4.9	5.31	no	hs_diploma	19798.37	38123	none
Ashe County	North Carolina	24384	27281	26957	0.25	19.4	79.2	4.4	4.18	no	some_college	24350.00	40293	none
Avery County	North Carolina	17167	17797	17536	-0.39	14.7	72.8	18.1	4.35	no	some_college	26362.67	37109	none
Beaufort County	North Carolina	44958	47759	47088	-0.64	19.1	73.4	9.2	5.13	no	some_college	23442.11	41101	NA
Bertie County	North Carolina	19773	21282	19224	-5.53	22.0	76.9	2.2	6.08	no	hs_diploma	19123.28	31287	none
Bladen County	North Carolina	32278	35190	33478	-3.53	24.5	69.0	5.5	5.97	no	hs_diploma	20570.82	32396	none
Brunswick County	North Carolina	73143	107431	130897	13.82	14.1	77.5	9.3	5.66	yes	some_college	29150.66	51164	none

Link to documentation: https://openintrostat.github.io/usdata/reference/county.html

Your turn!

Identify an example of each variable type in the NC counties data:

Continuous variable
Discrete variable
Nominal variable
Ordinal variable
Identifer variable

Describing distributions

Describing distributions of numeric variables

Shape
- skewness: right-skewed (tail to the right), left-skewed (tail to the left), symmetric
- modality: unimodal (one peak), bimodal (two peaks), multimodal (three or more peaks), uniform (no peaks)
Center: mean (average), median (50th percentile)
Spread: range (max - min), standard deviation (average distance from the mean), inter-quartile range (75th percentile - 25th percentile)
Unusual observations

We can use a histogram to visualize the distributions of numeric variables. Below is the a histogram of unemployment_rate.

Code

ggplot(data = nc_counties, aes(x = unemployment_rate)) + 
geom_histogram(color = "black", fill = "steelblue") + 
labs(x = "Unemployment rate")

Summary statistics for unemployment rate are below:

Code

nc_counties |>
skim(unemployment_rate) |> 
select(numeric.mean:numeric.p100)

# A tibble: 1 × 7
  numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75
         <dbl>      <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
1         5.02       1.03       3.49        4.30        4.72        5.58
# ℹ 1 more variable: numeric.p100 <dbl>

Your turn!

Describe the distribution of unemployment rate:

shape
center
spread
unusual observations (if any)

Describing distributions of categorical variables

We describe the distribution of categorical variables using visualizations and a frequency table that contains the number and/or proportion of observations in each category.

Below is a bar chart and frequency table showing the distribution of median_edu, the median education level (2013 - 2017):

Code

ggplot(data = nc_counties, aes(x = median_edu)) +
geom_bar(color = "black", fill = "darkcyan") + 
labs(x = "Median education (2013 - 2017)")

# A tibble: 3 × 3
  median_edu       n proportion
  <fct>        <int>      <dbl>
1 hs_diploma      43       0.43
2 some_college    55       0.55
3 bachelors        2       0.02

Your turn!

Describe the distribution of median education level.

We will look at more visualizations and summary statistics in the next session.

Computing in R and RStudio¹

Reproducibility

Your turn!

What does it mean for an analysis to be “reproducible”?

Near-term goals

Are the tables and figures reproducible from the code and data?
Does the code actually do what you think it does?
In addition to what was done, is it clear why it was done?

Long-term goals:

Can the code be used for other data?
Can you extend the code to do other things?

R and RStudio

R is an open-source statistical programming language
R is also an environment for statistical computing and graphics
It’s easily extensible with packages

RStudio:

RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
RStudio is not a requirement for programming with R, but it’s very commonly used by R programmers and data scientists

R is like the engine of a car and RStudio is like the inside.

Packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data
As of September 2020, there are over 16,000 R packages available on CRAN (the Comprehensive R Archive Network)
What can do most data analysis tasks using the tidyverse and tidymodels packages.

Accessing RStudio

Install R and RStudio on your computer (free): https://posit.co/download/rstudio-desktop/
Access RStudio online through Duke containers (free with Duke NetID):
1. Reserve container:
  - Go to https://cmgr.oit.duke.edu/containers. You will log in using your NetID credentials.
  - Click “Reserve RStudio” to reserve an RStudio container.
  - You only need to reserve a container once per semester.
2. Open RStudio container:
  - Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
  - Click RStudio to log into the Docker container. You should now see the RStudio environment.

Tour of RStudio

Editor
Console
Environment
Files + Plots + Viewer

Quarto document (.qmd)

Fully reproducible reports – the analysis is run from the beginning each time you render
Code goes in chunks and narrative goes outside of chunks
Visual editor to make document editing experience similar to a word processor (Google docs, Word, Pages, etc.)
Can produce multiple types of document using the same Quarto file (e.g., websites, presentations, word documents, academic publications, etc.)

Tour of Quarto document

Go to File -> New File -> Quarto Document

YAML
Text
Output
Rendered document

Additional resources

The content in this document is based on the resources listed below. These are great resources for more in-depth discussion of today’s topics and for additional practice.

Introduction to Modern Statistics by Mine Çetinkaya-Rundel and Jo Hardin
Modern Data Science with R by Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton
R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund

Footnotes

Content in this section from datasciencebox.org.↩︎

Introductions

Topics

Today

Next time

Goal of data analysis

Data analysis workflow

Data set: North Carolina counties

Understanding the data

Study designs

Types of variables

NC counties: Types of variables

Describing distributions

Describing distributions of numeric variables

Describing distributions of categorical variables

Computing in R and RStudio1

Reproducibility

R and RStudio

Packages

Accessing RStudio

Tour of RStudio

Quarto document (.qmd)

Tour of Quarto document

Additional resources

Footnotes

Computing in R and RStudio¹