3  Data Munging 2

Author

R Users Group

Published

September 25, 2024

3.1 Review

  • %>%: The pipe operator
  • dplyr verbs for data manipulation
    • select()
    • filter()
    • rename()
    • arrange()
    • mutate()
    • summarize()
    • group_by()

3.2 Motivation

  • Review and master library(dplyr)
  • Expand our skill set so we can do more types of analysis

3.2.1 Exercise 0

Step 1: Open up your .Rproj for urbn101. You should now see urbn101 in the top right of RStudio instead of “Project: (None)” .

Step 2: Install the R package nycflights13.

Step 3: Load tidyverse and nycflights13 at the top of your script.

Step 4: Run data(flights) in the console to load the flights data set into your R environment. Also run data(planes) to load the planes data set.

Step 5: Look at the flights data set (how many ways can you do this?)

3.3 Conditional transformation

Last week, we showed how mutate() can be used to create new variables or to transform new variables. We also learned about if_else()m which works great for one condition.

But what if you have multiple conditions? case_when() allows for a sequence of conditional logic that can be used to transform or create variables with mutate(). For example, here I create a variable called lateness that creates a character variable with the levels "very late", "late", and "on time". The logic is evaluated from top to bottom and TRUE is used to refer to all other remaining cases.

The syntax may look a little weird at first but it’s easy to pick up! The logical condition goes to the left of ~ and the output results goes to the right. & and | can be used to combine logical statements.

flights %>%
  mutate(
    lateness = case_when(
      arr_delay > 30 ~ "very late",
      arr_delay > 0 ~ "late",
      TRUE          ~ "on time"
    )
  )

3.3.1 Exercise 1

  1. Look at the planes data. We want to create a categorical variables called plane_type using the following rules for the variable seats:
  • “personal” if the plane has 4 or fewer seats
  • “commercial” if the plane has 5 to 199 seats
  • “jumbo if the plane has more than 200 seats
  1. Use mutate() and case_when() to create seats.
  2. Pipe into count(plane_tyepe).
library(tidyverse)
library(nycflights13)

planes %>%
  mutate(
    plane_type = case_when(
      seats <= 4   ~ "personal",
      seats <= 199 ~ "commercial",
      TRUE         ~ "jumbo"
    )
  ) %>%
  count(plane_type)
# A tibble: 3 × 2
  plane_type     n
  <chr>      <int>
1 commercial  2750
2 jumbo        551
3 personal      21

3.4 Strings

library(stringr) contains powerful and concise functions manipulating character variables. Reference the cheat sheet for an overview of the different string maniuplation functions.

3.4.1 Exercise 2

The month and day columns in flights are currently integer variables. We want to turn them into character variables with leading zeros. For example, 1 should be "01".

Step 1: Use mutate() to overwrite month and day with str_pad(). The first argument should be month or day. The second argument, width, should be 2.

Step 2: The padding character is currently a space, but we want it to be "0". Use ?str_pad to figure out how to switch the padding character.

Step 3: Pipe the result into the following line mutate(flight_date = paste(year, month, day, sep = "-"))

Step 4: Drop all variables except flight_date, distance, and air_time.

Step 5: Assign the result to flights_subset.

flights_subset <- flights %>%
  mutate(
    month = str_pad(month, width = 2, side = "left", pad = "0"),
    day = str_pad(day, width = 2, side = "left", pad = "0")
  ) %>%
  mutate(flight_date = paste(year, month, day, sep = "-")) %>%
  select(flight_date, distance, air_time)

3.5 Dates

library(lubridate) contains powerful and concise functions for creating and manipulating dates, times, and date-times. It is aware of leap days and leap seconds and is useful for calculating periods, durations, intervals, and more.

3.5.1 Exercise 3

Step 1: Add library(lubridate) after library(nycflight13) in your script.

Step 2: library(lubridate) is powerful but it needs variables in the correct format. Use ymd() inside of mutate() to turn the flight_date variable into a date rather than a character vector.

Step 3: Inside the previous mutate statement, add another column called weekday for the weekday of the flight. You can use wday(flight_date) to find the day of the week for each date.

Step 4: Assign the result to flights_subset.

Step 5: Use count() to count the number of flights by day of the week.

library(lubridate)

flights_subset <- flights_subset %>%
  mutate(flight_date = ymd(flight_date)) %>%
  mutate(weekday = wday(flight_date, label = TRUE)) 

flights_subset %>%
  count(weekday)
# A tibble: 7 × 2
  weekday     n
  <ord>   <int>
1 Sun     46357
2 Mon     50690
3 Tue     50422
4 Wed     50060
5 Thu     50219
6 Fri     50308
7 Sat     38720

3.6 group_by and summarize()

3.6.1 Exercise 4

We are going to summarize flights_subset from the previous example by weekday.

Step 1: group_by() weekday and use n() in summarize() to count the number of observations. This should match Step 5 from the previous exercise.

Step 2: In the same summarize(), calculate mean(), and max() distance.

Step 3: In the same summarize(), calculate median air_time.

Step 4: Rename the resulting variables inside summarize() so they have more useful names.

flights_subset %>%
  group_by(weekday) %>%
  summarize(
    n(),
    mean_distance = mean(distance),
    max_distance = max(distance),
    median_air_time = median(air_time, na.rm = TRUE)
  )
# A tibble: 7 × 5
  weekday `n()` mean_distance max_distance median_air_time
  <ord>   <int>         <dbl>        <dbl>           <dbl>
1 Sun     46357         1055.         4983             130
2 Mon     50690         1032.         4983             129
3 Tue     50422         1027.         4983             128
4 Wed     50060         1028.         4983             128
5 Thu     50219         1033.         4983             128
6 Fri     50308         1033.         4983             127
7 Sat     38720         1081.         4983             134

3.7 left_join()

Joins are the main method for combining two datasets with a commmon key column together. There are many types of joins, and we highly recommend you read this chapter on joins in R4DS if you want more info. Below is a quick visual summary of the types of joins you can perform in R. join_types

For now we will focus on the “left” join, which merges observations from the right data set to the left data set. This is the join type I use 90% of the time in R. Below is an example of how the left_join() function works.

people <- tribble(
  ~name, ~team, 
  "Aaron", "Pacers",
  "Kyle", "Wizards",
  "Ajjit", "Warriors",
  "Fay", "Wizards"
)

team_locations <- tribble(
  ~team, ~city,
  "Warriors", "Oakland",
  "Pacers", "Indianoplis",
  "Wizards", "Washington DC"
)

left_join(
  x = people,
  y = team_locations, 
  by = "team"
)
# A tibble: 4 × 3
  name  team     city         
  <chr> <chr>    <chr>        
1 Aaron Pacers   Indianoplis  
2 Kyle  Wizards  Washington DC
3 Ajjit Warriors Oakland      
4 Fay   Wizards  Washington DC

3.7.1 Exercise 5

flights contains information about flights and the unit of observation is airplane flights. planes contains information about the airplanes and the unit of observation is the airplane. The common column between these tables is the tailnum column.

We want to add information about planes to the flights data set. This is a left join because for every flight in the dataset, we want to append flight information. This is also called a many-to-one join because we are joining many rows from the flights data to one row from the planes data.

Step 1: Use left_join() to join planes to flights. The common key is tailnum.

Step 2: Use anti_join() to see observations from flights that don’t have a match in planes and call the output object unmatched_flights. The common key is tailnum.

Step 2: Use slice() and pull() to extract the tailnum value in the first row of unmatched_flights. Call this variable first_unmatched_tailnum

Step 3: Use filter() to see if first_unmatched_tailnum is in planes. Hint: it shouldn’t be!

left_join(
  x = flights,
  y = planes,
  by = "tailnum"
)
# A tibble: 336,776 × 27
   year.x month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1   2013     1     1      517            515         2      830            819
 2   2013     1     1      533            529         4      850            830
 3   2013     1     1      542            540         2      923            850
 4   2013     1     1      544            545        -1     1004           1022
 5   2013     1     1      554            600        -6      812            837
 6   2013     1     1      554            558        -4      740            728
 7   2013     1     1      555            600        -5      913            854
 8   2013     1     1      557            600        -3      709            723
 9   2013     1     1      557            600        -3      838            846
10   2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 19 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, year.y <int>, type <chr>,
#   manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
#   engine <chr>
first_unmatched_tailnum <- anti_join(
  x = flights,
  y = planes,
  by = "tailnum"
) %>%
  slice(1) %>%
  pull(tailnum)

planes %>%
  filter(tailnum == first_unmatched_tailnum)
# A tibble: 0 × 9
# ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,
#   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

3.8 Custom functions

Sometimes functions don’t exist for desired calculations or we want to combine many calculations into one function to reduce copying-and-pasting.

“You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).” ~ R4DS

R has a flexible function system that makes it very easy to define custom functions!

function_name <- function(arg1, arg2 = default) {

  # function body

}

Three ingredients

  • Function name - usually verbs
  • Function arguments - inputs to the function (optional)
  • Function body
square <- function(x = 2) {
  x ^ 2
}

square()
[1] 4
square(x = 4)
[1] 16

Note: Using tidyverse functions inside of custom functions often requires non-standard evaluation. Please reach out for help when this is your goal.

3.8.1 Exercise 6

Step 1: Write a function called multiply_xy() that takes arguments x and y and multiplies them together.

Step 2: Add your favorite number as the default for x and your least favorite number as the default for y.

Step 3: Call the function and overwrite the default for y with your favorite number.

multiply_xy <- function(x = 3.14, y = 7) {
  
  x * y
  
}

multiply_xy(y = 3.14) 
[1] 9.8596

3.9 Resources