Please use all code samples responsibly - these are samples and likely require adjustments to work correctly for your specific needs. Read through the documentation and comments to understand any caveats or limitations of the code and/or data and follow-up with the code author or Code Library admins (code_library@urban.org) if you have questions on how to adapt the sample to your specific use case.
Purpose: This code looks across multiple columns and creates new columns with a flag based on those values. In this example, we have columns for the time it took between different program benchmarks (client assessment to program start, program start to lease start, and lease start to lease end) and we want to bin those values into buckets (i.e., one week, two weeks, a month, etc.). The benefit of this code is that it creates new columns with those bins and also keeps the original ones because we need both the raw data and binned data.
Data: The sample data is for time between program benchmarks, but can be applied to many common data cleaning tasks where you want to make new columns with buckets or flags based on multiple original columns.
Author: Amy Rogin (January 2024)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)time_to_service <-read_csv("data/time_to_service.csv") %>%# reformat columns to be date types (myd specifies that the original format is# month day year)mutate(across(.col =everything(), ~mdy(.)),# Date of assessment to referralassess_to_start =as.numeric(project_start_date - date_of_assessment),# Date of referral to date of project start (days)referral_to_start =as.numeric(project_start_date - date_referred),# Date of project start to lease start date (days)start_to_lease =as.numeric(lease_start_date - project_start_date), # MUTATE ACROSS MUTIPLE COLUMNS - CREATES BUCKETS FOR THE TIMELINES# The first line specifies that we want to mutate across the "assess_to_start", "referral_to_start", and "start_to_lease" columnsacross(c(assess_to_start, referral_to_start, start_to_lease),# the `list` creates a new column with the suffix "_bin"list(bin =~case_when(# the `case_when` is essentially multiple if else statements# if the column value is less than 7->make the bin column value "Less than 1 week"" . <7~"Less than 1 week", . >=7& . <=14~"1-2 weeks", . >14& . <=30~"2 weeks to 1 month", . >30& . <=60~"1-2 months", . >60~"More than 2 months")), # the .names specifies that the new columns should take the original {col} name and add "_bin" as the suffix# in this example it would create the col name "assess_to_start_bin" .names ="{col}_{fn}"))
Rows: 925 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): date_of_assessment, date_referred, project_start_date, lease_start_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show outputtime_to_service %>%select(assess_to_start, referral_to_start,assess_to_start_bin, referral_to_start_bin) %>%head()
# A tibble: 6 × 4
assess_to_start referral_to_start assess_to_start_bin referral_to_start_bin
<dbl> <dbl> <chr> <chr>
1 -12 210 Less than 1 week More than 2 months
2 0 -3 Less than 1 week Less than 1 week
3 26 23 2 weeks to 1 month 2 weeks to 1 month
4 0 -3 Less than 1 week Less than 1 week
5 0 0 Less than 1 week Less than 1 week
6 128 127 More than 2 months More than 2 months