2  Data Munging 1

Author

R Users Group

Published

September 25, 2024

2.1 Review

  • Console/Environment/Script
  • Comment your code with # and read your error messages

2.2 Assignment

<- is the assignment operator. An object created on the right side of an assignment operator is assigned to a name on the left side of an assignment operator. Assignment operators are important for saving the consequences of operations and functions. Without assignment, the result of a calculation is not saved for use in future calculations. Operations without assignment operators will typically be printed to the console but not saved for future use.

a <- 1
b <- 2

c <- a + b

c
[1] 3

2.3 Tidy Data

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ~ tidyverse.org

library(tidyverse) contains:

  • ggplot2, for data visualization.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for functional programming.
  • tibble, for tibbles, a modern re-imagining of data frames.
  • stringr, for strings.
  • forcats, for factors.

2.3.1 Tidy data

The defining opinion of the tidyverse is its wholehearted adoption of tidy data. Tidy data has three features:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a dataframe. (This is from the paper, not the book)

Source: R for data science

Tidy datasets are all alike, but every messy dataset is messy in its own way. ~ Hadley Wickham

The tidy approach to data science is powerful because it breaks data work into two distinct parts. First, get the data into a tidy format. Second, use tools optimized for tidy data. By standardizing the data structure for most community-created tools, the framework oriented diffuse development and reduced the friction of data work.

2.4 Exercise 0: Creating a Project and Loading Packages

If you are using a different computer or didn’t attend sessions 0 or 1, follow steps 1 and 2. If not- skip to step 3.

Step 1: Open RStudio. File > New Project > New Directory > Select the location where you would like to create a new folder that houses your R Project. Call it urbn101.

Step 2: Open an .R script with the button in the top left (sheet with a plus sign icon). Save the script as 02_data-munging1.R.

Step 3: If you have not previously installed library(tidyverse): submit install.packages("tidyverse") to the Console (type and hit enter)

We’ll focus on the key dplyr syntax using the March 2020 Annual Social and Economic Supplement (ASEC) to the Current Population Survey (CPS). Run the following code to load the data.

Step 4: Add and run the following code to load ASEC data.

library(tidyverse)

asec <- read_csv(
  paste0(
    "https://raw.githubusercontent.com/awunderground/awunderground-data/",
    "main/cps/cps-asec.csv"
  )
)

We can use glimpse(asec) to quickly view the data. We can also use View(asec) to open up asec in RStudio.


glimpse(x = asec)
Rows: 157,959
Columns: 17
$ year       <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,…
$ serial     <dbl> 1, 1, 2, 2, 3, 4, 4, 5, 5, 5, 5, 7, 8, 9, 10, 10, 10, 12, 1…
$ month      <chr> "March", "March", "March", "March", "March", "March", "Marc…
$ cpsid      <dbl> 2.01903e+13, 2.01903e+13, 2.01812e+13, 2.01812e+13, 2.01902…
$ asecflag   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ asecwth    <dbl> 1552.90, 1552.90, 990.49, 990.49, 1505.27, 1430.70, 1430.70…
$ pernum     <dbl> 1, 2, 1, 2, 1, 1, 2, 1, 2, 3, 4, 1, 1, 1, 1, 2, 3, 1, 2, 3,…
$ cpsidp     <dbl> 2.01903e+13, 2.01903e+13, 2.01812e+13, 2.01812e+13, 2.01902…
$ asecwt     <dbl> 1552.90, 1552.90, 990.49, 990.49, 1505.27, 1430.70, 1196.57…
$ ftype      <chr> "Primary family", "Primary family", "Primary family", "Prim…
$ ftotval    <dbl> 127449, 127449, 64680, 64680, 40002, 8424, 8424, 59114, 591…
$ inctot     <dbl> 52500, 74949, 44000, 20680, 40002, 0, 8424, 610, 58001, 503…
$ incwage    <dbl> 52500, 56000, 34000, 0, 40000, 0, 8424, 0, 58000, 0, 0, 0, …
$ offpov     <chr> "Above Poverty Line", "Above Poverty Line", "Above Poverty …
$ offpovuniv <chr> "In Poverty Universe", "In Poverty Universe", "In Poverty U…
$ offtotval  <dbl> 127449, 127449, 64680, 64680, 40002, 8424, 8424, 59114, 591…
$ offcutoff  <dbl> 17120, 17120, 17120, 17120, 13300, 15453, 15453, 26370, 263…


We’re going to learn seven functions and one new piece of syntax from library(dplyr) that will be our main tools for manipulating tidy frames. These functions and a few extensions outlined in the Data Transformation Cheat Sheet are the core of data analysis in the Tidyverse.

2.5 1. select()

select() drops columns from a dataframe and/or reorders the columns in a dataframe. The arguments after the name of the dataframe should be the names of columns you wish to keep, without quotes. All other columns not listed are dropped.


select(.data = asec, year, month, serial)
# A tibble: 157,959 × 3
    year month serial
   <dbl> <chr>  <dbl>
 1  2020 March      1
 2  2020 March      1
 3  2020 March      2
 4  2020 March      2
 5  2020 March      3
 6  2020 March      4
 7  2020 March      4
 8  2020 March      5
 9  2020 March      5
10  2020 March      5
# ℹ 157,949 more rows


This works great until the goal is to select 99 of 100 variables. Fortunately, - can be used to remove variables. You can also select all but multiple variables by listing them with the - symbol separated by commas.


select(.data = asec, -asecflag)
# A tibble: 157,959 × 16
    year serial month   cpsid asecwth pernum  cpsidp asecwt ftype ftotval inctot
   <dbl>  <dbl> <chr>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>   <dbl>  <dbl>
 1  2020      1 March 2.02e13   1553.      1 2.02e13  1553. Prim…  127449  52500
 2  2020      1 March 2.02e13   1553.      2 2.02e13  1553. Prim…  127449  74949
 3  2020      2 March 2.02e13    990.      1 2.02e13   990. Prim…   64680  44000
 4  2020      2 March 2.02e13    990.      2 2.02e13   990. Prim…   64680  20680
 5  2020      3 March 2.02e13   1505.      1 2.02e13  1505. Nonf…   40002  40002
 6  2020      4 March 2.02e13   1431.      1 2.02e13  1431. Prim…    8424      0
 7  2020      4 March 2.02e13   1431.      2 2.02e13  1197. Prim…    8424   8424
 8  2020      5 March 2.02e13   1133.      1 2.02e13  1133. Prim…   59114    610
 9  2020      5 March 2.02e13   1133.      2 2.02e13  1133. Prim…   59114  58001
10  2020      5 March 2.02e13   1133.      3 2.02e13  1322. Prim…   59114    503
# ℹ 157,949 more rows
# ℹ 5 more variables: incwage <dbl>, offpov <chr>, offpovuniv <chr>,
#   offtotval <dbl>, offcutoff <dbl>


2.5.1 Exercise 1

  1. Select pernum and inctot from asec.
  1. Select pernum and inctot from asec.
select(.data = asec, inctot, asec)

\[\cdots\]

2.6 2. rename()

rename() renames columns in a data frame. The pattern is new_name = old_name.


rename(.data = asec, serial_number = serial)
# A tibble: 157,959 × 17
    year serial_number month   cpsid asecflag asecwth pernum  cpsidp asecwt
   <dbl>         <dbl> <chr>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
 1  2020             1 March 2.02e13        1   1553.      1 2.02e13  1553.
 2  2020             1 March 2.02e13        1   1553.      2 2.02e13  1553.
 3  2020             2 March 2.02e13        1    990.      1 2.02e13   990.
 4  2020             2 March 2.02e13        1    990.      2 2.02e13   990.
 5  2020             3 March 2.02e13        1   1505.      1 2.02e13  1505.
 6  2020             4 March 2.02e13        1   1431.      1 2.02e13  1431.
 7  2020             4 March 2.02e13        1   1431.      2 2.02e13  1197.
 8  2020             5 March 2.02e13        1   1133.      1 2.02e13  1133.
 9  2020             5 March 2.02e13        1   1133.      2 2.02e13  1133.
10  2020             5 March 2.02e13        1   1133.      3 2.02e13  1322.
# ℹ 157,949 more rows
# ℹ 8 more variables: ftype <chr>, ftotval <dbl>, inctot <dbl>, incwage <dbl>,
#   offpov <chr>, offpovuniv <chr>, offtotval <dbl>, offcutoff <dbl>


You can also rename a selection of variables using rename_with(). The .cols argument is used to select the columns to rename and takes a tidyselect statement like those we introduced above. Here, we’re using the where() selection helper which selects all columns where a given condition is TRUE. The default value for the .cols argument is everything() which selects all columns in the dataset.


rename_with(.data = asec, .fn = toupper, .cols = where(is.numeric))
# A tibble: 157,959 × 17
    YEAR SERIAL month   CPSID ASECFLAG ASECWTH PERNUM  CPSIDP ASECWT ftype      
   <dbl>  <dbl> <chr>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>      
 1  2020      1 March 2.02e13        1   1553.      1 2.02e13  1553. Primary fa…
 2  2020      1 March 2.02e13        1   1553.      2 2.02e13  1553. Primary fa…
 3  2020      2 March 2.02e13        1    990.      1 2.02e13   990. Primary fa…
 4  2020      2 March 2.02e13        1    990.      2 2.02e13   990. Primary fa…
 5  2020      3 March 2.02e13        1   1505.      1 2.02e13  1505. Nonfamily …
 6  2020      4 March 2.02e13        1   1431.      1 2.02e13  1431. Primary fa…
 7  2020      4 March 2.02e13        1   1431.      2 2.02e13  1197. Primary fa…
 8  2020      5 March 2.02e13        1   1133.      1 2.02e13  1133. Primary fa…
 9  2020      5 March 2.02e13        1   1133.      2 2.02e13  1133. Primary fa…
10  2020      5 March 2.02e13        1   1133.      3 2.02e13  1322. Primary fa…
# ℹ 157,949 more rows
# ℹ 7 more variables: FTOTVAL <dbl>, INCTOT <dbl>, INCWAGE <dbl>, offpov <chr>,
#   offpovuniv <chr>, OFFTOTVAL <dbl>, OFFCUTOFF <dbl>


Most dplyr functions can rename columns simply by prefacing the operation with new_name =. For example, this can be done with select():


select(.data = asec, year, month, serial_number = serial)
# A tibble: 157,959 × 3
    year month serial_number
   <dbl> <chr>         <dbl>
 1  2020 March             1
 2  2020 March             1
 3  2020 March             2
 4  2020 March             2
 5  2020 March             3
 6  2020 March             4
 7  2020 March             4
 8  2020 March             5
 9  2020 March             5
10  2020 March             5
# ℹ 157,949 more rows


2.7 3. filter()

filter() reduces the number of observations in a dataframe. Every column in a dataframe has a name. Rows do not necessarily have names in a dataframe, so rows need to be filtered based on logical conditions.

==, <, >, <=, >=, !=, %in%, and is.na() are all operators that can be used for logical conditions. ! can be used to negate a condition and & and | can be used to combine conditions. | means or.


# return rows with pernum of 1 and incwage > $100,000
filter(.data = asec, pernum == 1 & incwage > 100000)
# A tibble: 5,551 × 17
    year serial month   cpsid asecflag asecwth pernum  cpsidp asecwt ftype      
   <dbl>  <dbl> <chr>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>      
 1  2020     28 March 2.02e13        1    678.      1 2.02e13   678. Primary fa…
 2  2020    134 March 0              1    923.      1 0         923. Primary fa…
 3  2020    136 March 2.02e13        1    906.      1 2.02e13   906. Primary fa…
 4  2020    137 March 2.02e13        1   1493.      1 2.02e13  1493. Nonfamily …
 5  2020    359 March 2.02e13        1    863.      1 2.02e13   863. Primary fa…
 6  2020    372 March 2.02e13        1   1338.      1 2.02e13  1338. Primary fa…
 7  2020    404 March 0              1    677.      1 0         677. Primary fa…
 8  2020    420 March 2.02e13        1    747.      1 2.02e13   747. Primary fa…
 9  2020    450 March 2.02e13        1   1309.      1 2.02e13  1309. Primary fa…
10  2020    491 March 0              1   1130.      1 0        1130. Primary fa…
# ℹ 5,541 more rows
# ℹ 7 more variables: ftotval <dbl>, inctot <dbl>, incwage <dbl>, offpov <chr>,
#   offpovuniv <chr>, offtotval <dbl>, offcutoff <dbl>

IPUMS CPS contains full documentation with information about pernum and incwage.


2.7.1 Exercise 2

  1. Filter asec to rows with month equal to "March".
  2. Filter asec to rows with inctot less than 999999999.
  3. Filter asec to rows with pernum equal to 3 and inctot less than 999999999.
  1. Filter asec to rows with month equal to "March".
  2. Filter asec to rows with inctot less than 999999999.
  3. Filter asec to rows with pernum equal to 3 and inctot less than 999999999.
filter(asec, month == "March")

filter(asec, inctot < 999999999)

filter(asec, pernum == 3, inctot < 999999999)

2.8 4. arrange()

arrange() sorts the rows of a data frame in alpha-numeric order based on the values of a variable or variables. The dataframe is sorted by the first variable first and each subsequent variable is used to break ties. desc() is used to reverse the sort order for a given variable.


# sort pernum is descending order because high pernums are interesting
arrange(.data = asec, desc(pernum))
# A tibble: 157,959 × 17
    year serial month   cpsid asecflag asecwth pernum  cpsidp asecwt ftype      
   <dbl>  <dbl> <chr>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>      
 1  2020  91430 March 0              1    505.     16 0         604. Secondary …
 2  2020  91430 March 0              1    505.     15 0         465. Secondary …
 3  2020  91430 March 0              1    505.     14 0         416. Secondary …
 4  2020  15037 March 2.02e13        1   2272.     13 2.02e13  2633. Primary fa…
 5  2020  78495 March 0              1   1279.     13 0        1424. Related su…
 6  2020  91430 March 0              1    505.     13 0         465. Secondary …
 7  2020  15037 March 2.02e13        1   2272.     12 2.02e13  1689. Primary fa…
 8  2020  18102 March 0              1   2468.     12 0        2871. Primary fa…
 9  2020  22282 March 0              1   2801.     12 0        3879. Related su…
10  2020  30274 March 2.02e13        1    653.     12 2.02e13   858. Primary fa…
# ℹ 157,949 more rows
# ℹ 7 more variables: ftotval <dbl>, inctot <dbl>, incwage <dbl>, offpov <chr>,
#   offpovuniv <chr>, offtotval <dbl>, offcutoff <dbl>


2.8.1 Exercise 3

  1. Sort asec in descending order by pernum and ascending order by inctot.
  1. Sort asec in descending order by pernum and ascending order by inctot.
arrange(asec, desc(pernum), inctot)

2.9 5. mutate()

mutate() creates new variables or edits existing variables. We can use arithmetic arguments, such as +, -, *, /, and ^. We can also custom functions and functions from packages. For example, we can use library(stringr) for string manipulation and library(lubridate) for date manipulation.


Variables are created by adding a new column name, like inctot_adjusted, to the left of = in mutate().

# adjust inctot for underreporting
mutate(.data = asec, inctot_adjusted = inctot * 1.1)
# A tibble: 157,959 × 18
    year serial month   cpsid asecflag asecwth pernum  cpsidp asecwt ftype      
   <dbl>  <dbl> <chr>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>      
 1  2020      1 March 2.02e13        1   1553.      1 2.02e13  1553. Primary fa…
 2  2020      1 March 2.02e13        1   1553.      2 2.02e13  1553. Primary fa…
 3  2020      2 March 2.02e13        1    990.      1 2.02e13   990. Primary fa…
 4  2020      2 March 2.02e13        1    990.      2 2.02e13   990. Primary fa…
 5  2020      3 March 2.02e13        1   1505.      1 2.02e13  1505. Nonfamily …
 6  2020      4 March 2.02e13        1   1431.      1 2.02e13  1431. Primary fa…
 7  2020      4 March 2.02e13        1   1431.      2 2.02e13  1197. Primary fa…
 8  2020      5 March 2.02e13        1   1133.      1 2.02e13  1133. Primary fa…
 9  2020      5 March 2.02e13        1   1133.      2 2.02e13  1133. Primary fa…
10  2020      5 March 2.02e13        1   1133.      3 2.02e13  1322. Primary fa…
# ℹ 157,949 more rows
# ℹ 8 more variables: ftotval <dbl>, inctot <dbl>, incwage <dbl>, offpov <chr>,
#   offpovuniv <chr>, offtotval <dbl>, offcutoff <dbl>, inctot_adjusted <dbl>


Variables are edited by including an existing column name, like inctot, to the left of = in mutate().

# adjust income because of underreporting
mutate(.data = asec, inctot = inctot * 1.1)
# A tibble: 157,959 × 17
    year serial month   cpsid asecflag asecwth pernum  cpsidp asecwt ftype      
   <dbl>  <dbl> <chr>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>      
 1  2020      1 March 2.02e13        1   1553.      1 2.02e13  1553. Primary fa…
 2  2020      1 March 2.02e13        1   1553.      2 2.02e13  1553. Primary fa…
 3  2020      2 March 2.02e13        1    990.      1 2.02e13   990. Primary fa…
 4  2020      2 March 2.02e13        1    990.      2 2.02e13   990. Primary fa…
 5  2020      3 March 2.02e13        1   1505.      1 2.02e13  1505. Nonfamily …
 6  2020      4 March 2.02e13        1   1431.      1 2.02e13  1431. Primary fa…
 7  2020      4 March 2.02e13        1   1431.      2 2.02e13  1197. Primary fa…
 8  2020      5 March 2.02e13        1   1133.      1 2.02e13  1133. Primary fa…
 9  2020      5 March 2.02e13        1   1133.      2 2.02e13  1133. Primary fa…
10  2020      5 March 2.02e13        1   1133.      3 2.02e13  1322. Primary fa…
# ℹ 157,949 more rows
# ℹ 7 more variables: ftotval <dbl>, inctot <dbl>, incwage <dbl>, offpov <chr>,
#   offpovuniv <chr>, offtotval <dbl>, offcutoff <dbl>


Conditional logic inside of mutate() with functions like if_else() and case_when() is key to mastering data munging in R.

2.9.1 Exercise 4

  1. Create a new variable called in_poverty. If offtotval is less than offcutoff then use "Below Poverty Line". Otherwise, use "Above Poverty Line". Hint: if_else() is useful and works like the IF command in Microsoft Excel.
  1. Create a new variable called in_poverty. If offtotval is less than offcutoff then use "Below Poverty Line". Otherwise, use "Above Poverty Line". Hint: if_else() is useful and works like the IF command in Microsoft Excel.
mutate(
  asec,
  in_poverty = if_else(
    condition = offtotval < offcutoff, 
    true = "Below Poverty Line", 
    false = "Above Poverty Line"
  )
)

2.10 %>%

Data munging is tiring when each operation needs to be assigned to a name with <-. The pipe, %>%, allows lines of code to be chained together so the assignment operator only needs to be used once.

Consider this fake code example from Hadley Wickham:

I %>% 
  tumble(out_of = "bed") %>% 
  stumble(to = "the kitchen") %>% 
  pour(who = "myself", unit = "cup", what = "ambition") %>% 
  yawn() %>% 
  stretch() %>% 
  try(come_to_life())


%>% passes the output from function as the first argument in a subsequent function. For example, this line can be rewritten:


# old way
mutate(.data = asec, inctot_adjusted = inctot * 1.1)

# new way
asec %>%
  mutate(inctot_adjusted = inctot * 1.1)


See the power:


new_asec <- asec %>%
  filter(pernum == 1) %>%
  select(year, month, pernum, inctot) %>%
  mutate(inctot_adjusted = inctot * 1.1) %>%
  select(-inctot)

new_asec
# A tibble: 60,460 × 4
    year month pernum inctot_adjusted
   <dbl> <chr>  <dbl>           <dbl>
 1  2020 March      1          57750 
 2  2020 March      1          48400 
 3  2020 March      1          44002.
 4  2020 March      1              0 
 5  2020 March      1            671 
 6  2020 March      1          19279.
 7  2020 March      1          12349.
 8  2020 March      1          21589.
 9  2020 March      1          47306.
10  2020 March      1          10949.
# ℹ 60,450 more rows


2.11 6. summarize()

summarize() collapses many rows in a dataframe into fewer rows with summary statistics of the many rows. n(), mean(), and sum() are common summary statistics. Renaming is useful with summarize()!


# summarize without renaming the statistics
asec %>%
  summarize(mean(ftotval), mean(inctot))
# A tibble: 1 × 2
  `mean(ftotval)` `mean(inctot)`
            <dbl>          <dbl>
1         105254.     209921275.
# summarize and rename the statistics
asec %>%
  summarize(
    mean_ftotval = mean(ftotval), 
    mean_inctot = mean(inctot)
  )
# A tibble: 1 × 2
  mean_ftotval mean_inctot
         <dbl>       <dbl>
1      105254.  209921275.


summarize() returns a data frame. This means all dplyr functions can be used on the output of summarize(). This is powerful! Manipulating summary statistics in Stata and SAS can be a chore. Here, it’s just another dataframe that can be manipulated with a tool set optimized for dataframes: dplyr.

2.12 7. group_by()

group_by() groups a dataframe based on specified variables. summarize() with grouped dataframes creates subgroup summary statistics. mutate() with group_by() calculates grouped summaries for each row.


asec %>%
  group_by(pernum) %>%
  summarize(
    n = n(),
    mean_ftotval = mean(ftotval), 
    mean_inctot = mean(inctot)
  )
# A tibble: 16 × 4
   pernum     n mean_ftotval mean_inctot
    <dbl> <int>        <dbl>       <dbl>
 1      1 60460       94094.      57508.
 2      2 45151      108700.   77497357.
 3      3 25650      117966.  473030618.
 4      4 15797      121815.  634999933.
 5      5  6752      108609.  691504650.
 6      6  2582       89448.  682810446.
 7      7   922       78889.  682218196.
 8      8   353       72284.  682725646.
 9      9   158       54599.  632917559.
10     10    73       58145.  657543632.
11     11    37       61847.  702708584 
12     12    18       50249.  777780725.
13     13     3       25152   666666666 
14     14     1       18000       18000 
15     15     1       25000       25000 
16     16     1       15000       15000 


Dataframes can be grouped by multiple variables.

Grouped tibbles include metadata about groups. For example, Groups: pernum, offpov [40]. One grouping is dropped each time summarize() is used. It is easy to forget if a dataframe is grouped, so it is safe to include ungroup() at the end of a section of functions.


asec %>%
  group_by(pernum, offpov) %>%
  summarize(
    n = n(),
    mean_ftotval = mean(ftotval), 
    mean_inctot = mean(inctot)
  ) %>%
  arrange(offpov) %>%
  ungroup()
`summarise()` has grouped output by 'pernum'. You can override using the
`.groups` argument.
# A tibble: 40 × 5
   pernum offpov                 n mean_ftotval mean_inctot
    <dbl> <chr>              <int>        <dbl>       <dbl>
 1      1 Above Poverty Line 53872      104451.      63642.
 2      2 Above Poverty Line 40978      118691.   59082162.
 3      3 Above Poverty Line 23052      129891.  463440562.
 4      4 Above Poverty Line 14076      135039.  631720097.
 5      5 Above Poverty Line  5805      123937.  688206447.
 6      6 Above Poverty Line  2118      105867.  683199297.
 7      7 Above Poverty Line   724       96817.  697520661.
 8      8 Above Poverty Line   269       90328.  672870019.
 9      9 Above Poverty Line   114       70438.  622815186.
10     10 Above Poverty Line    57       71483.  666678408.
# ℹ 30 more rows


2.12.1 Exercise 5

  1. filter() to only include observations with "In Poverty Universe" in offpovuniv.
  2. group_by() offpov.
  3. Use summarize() and n() to count the number of observations in poverty.
  1. filter() to only include observations with "In Poverty Universe" in offpovuniv.
  2. group_by() offpov.
  3. Use summarize() and n() to count the number of observations in poverty.
asec %>%
  filter(offpovuniv == "In Poverty Universe") %>%
  group_by(offpov) %>%
  summarize(n())

2.12.2 Exercise 6

  1. filter() to only include observations with "In Poverty Universe".
  2. group_by() cpsid.
  3. Use mutate(family_size = n()) to calculate the family size for each observation in asec.
  4. ungroup()
  5. group_by() family_size, and offpov.
  6. Use summarize() and n() to see how many families of each size are experiencing poverty.
  1. filter() to only include observations with "In Poverty Universe".
  2. group_by() cpsid.
  3. Use mutate(family_size = n()) to calculate the family size for each observation in asec.
  4. ungroup()
  5. group_by() family_size, and offpov.
  6. Use summarize() and n() to see how many families of each size are experiencing poverty.
asec %>%
  filter(offpovuniv == "In Poverty Universe") %>%
  group_by(cpsid) %>%
  mutate(family_size = n()) %>%
  ungroup() %>%
  group_by(family_size, offpov) %>%
  summarize(n())

Are the estimates from the previous two exercises correct?

Let’s look at a Census Report to see how many people were in poverty in 2019. We estimated about 16,500 people. The Census Bureau says 34.0 million people.

No! We did not account for sampling weights, so our estimates are incorrect. library(srvyr) has tools for weighted estimation with complex surveys.

2.13 Resources