%>%
flights mutate(
lateness = case_when(
> 30 ~ "very late",
arr_delay > 0 ~ "late",
arr_delay TRUE ~ "on time"
) )
3 Data Munging 2
3.1 Review
%>%
: The pipe operatordplyr
verbs for data manipulationselect()
filter()
rename()
arrange()
mutate()
summarize()
group_by()
3.2 Motivation
- Review and master
library(dplyr)
- Expand our skill set so we can do more types of analysis
3.2.1 Exercise 0
Step 1: Open up your .Rproj
for urbn101
. You should now see urbn101
in the top right of RStudio instead of “Project: (None)” .
Step 2: Install the R package nycflights13
.
Step 3: Load tidyverse
and nycflights13
at the top of your script.
Step 4: Run data(flights)
in the console to load the flights data set into your R environment. Also run data(planes)
to load the planes data set.
Step 5: Look at the flights
data set (how many ways can you do this?)
3.3 Conditional transformation
Last week, we showed how mutate()
can be used to create new variables or to transform new variables. We also learned about if_else()
m which works great for one condition.
But what if you have multiple conditions? case_when()
allows for a sequence of conditional logic that can be used to transform or create variables with mutate()
. For example, here I create a variable called lateness
that creates a character variable with the levels "very late"
, "late"
, and "on time"
. The logic is evaluated from top to bottom and TRUE
is used to refer to all other remaining cases.
The syntax may look a little weird at first but it’s easy to pick up! The logical condition goes to the left of ~
and the output results goes to the right. &
and |
can be used to combine logical statements.
3.3.1 Exercise 1
- Look at the
planes
data. We want to create a categorical variables calledplane_type
using the following rules for the variableseats
:
- “personal” if the plane has 4 or fewer seats
- “commercial” if the plane has 5 to 199 seats
- “jumbo if the plane has more than 200 seats
- Use
mutate()
andcase_when()
to createseats
. - Pipe into
count(plane_tyepe)
.
library(tidyverse)
library(nycflights13)
%>%
planes mutate(
plane_type = case_when(
<= 4 ~ "personal",
seats <= 199 ~ "commercial",
seats TRUE ~ "jumbo"
)%>%
) count(plane_type)
# A tibble: 3 × 2
plane_type n
<chr> <int>
1 commercial 2750
2 jumbo 551
3 personal 21
3.4 Strings
library(stringr)
contains powerful and concise functions manipulating character variables. Reference the cheat sheet for an overview of the different string maniuplation functions.
3.4.1 Exercise 2
The month
and day
columns in flights
are currently integer variables. We want to turn them into character variables with leading zeros. For example, 1
should be "01"
.
Step 1: Use mutate()
to overwrite month
and day
with str_pad()
. The first argument should be month
or day
. The second argument, width
, should be 2
.
Step 2: The padding character is currently a space, but we want it to be "0"
. Use ?str_pad
to figure out how to switch the padding character.
Step 3: Pipe the result into the following line mutate(flight_date = paste(year, month, day, sep = "-"))
Step 4: Drop all variables except flight_date
, distance
, and air_time
.
Step 5: Assign the result to flights_subset
.
<- flights %>%
flights_subset mutate(
month = str_pad(month, width = 2, side = "left", pad = "0"),
day = str_pad(day, width = 2, side = "left", pad = "0")
%>%
) mutate(flight_date = paste(year, month, day, sep = "-")) %>%
select(flight_date, distance, air_time)
3.5 Dates
library(lubridate)
contains powerful and concise functions for creating and manipulating dates, times, and date-times. It is aware of leap days and leap seconds and is useful for calculating periods, durations, intervals, and more.
3.5.1 Exercise 3
Step 1: Add library(lubridate)
after library(nycflight13)
in your script.
Step 2: library(lubridate)
is powerful but it needs variables in the correct format. Use ymd()
inside of mutate()
to turn the flight_date
variable into a date rather than a character vector.
Step 3: Inside the previous mutate
statement, add another column called weekday
for the weekday of the flight. You can use wday(flight_date)
to find the day of the week for each date.
Step 4: Assign the result to flights_subset
.
Step 5: Use count()
to count the number of flights by day of the week.
library(lubridate)
<- flights_subset %>%
flights_subset mutate(flight_date = ymd(flight_date)) %>%
mutate(weekday = wday(flight_date, label = TRUE))
%>%
flights_subset count(weekday)
# A tibble: 7 × 2
weekday n
<ord> <int>
1 Sun 46357
2 Mon 50690
3 Tue 50422
4 Wed 50060
5 Thu 50219
6 Fri 50308
7 Sat 38720
3.6 group_by
and summarize()
3.6.1 Exercise 4
We are going to summarize flights_subset
from the previous example by weekday
.
Step 1: group_by()
weekday
and use n()
in summarize()
to count the number of observations. This should match Step 5 from the previous exercise.
Step 2: In the same summarize()
, calculate mean()
, and max()
distance
.
Step 3: In the same summarize()
, calculate median air_time
.
Step 4: Rename the resulting variables inside summarize()
so they have more useful names.
%>%
flights_subset group_by(weekday) %>%
summarize(
n(),
mean_distance = mean(distance),
max_distance = max(distance),
median_air_time = median(air_time, na.rm = TRUE)
)
# A tibble: 7 × 5
weekday `n()` mean_distance max_distance median_air_time
<ord> <int> <dbl> <dbl> <dbl>
1 Sun 46357 1055. 4983 130
2 Mon 50690 1032. 4983 129
3 Tue 50422 1027. 4983 128
4 Wed 50060 1028. 4983 128
5 Thu 50219 1033. 4983 128
6 Fri 50308 1033. 4983 127
7 Sat 38720 1081. 4983 134
3.7 left_join()
Joins are the main method for combining two datasets with a commmon key column together. There are many types of joins, and we highly recommend you read this chapter on joins in R4DS if you want more info. Below is a quick visual summary of the types of joins you can perform in R.
For now we will focus on the “left” join, which merges observations from the right data set to the left data set. This is the join type I use 90% of the time in R. Below is an example of how the left_join()
function works.
<- tribble(
people ~name, ~team,
"Aaron", "Pacers",
"Kyle", "Wizards",
"Ajjit", "Warriors",
"Fay", "Wizards"
)
<- tribble(
team_locations ~team, ~city,
"Warriors", "Oakland",
"Pacers", "Indianoplis",
"Wizards", "Washington DC"
)
left_join(
x = people,
y = team_locations,
by = "team"
)
# A tibble: 4 × 3
name team city
<chr> <chr> <chr>
1 Aaron Pacers Indianoplis
2 Kyle Wizards Washington DC
3 Ajjit Warriors Oakland
4 Fay Wizards Washington DC
3.7.1 Exercise 5
flights
contains information about flights and the unit of observation is airplane flights. planes
contains information about the airplanes and the unit of observation is the airplane. The common column between these tables is the tailnum
column.
We want to add information about planes to the flights data set. This is a left join because for every flight in the dataset, we want to append flight information. This is also called a many-to-one join because we are joining many rows from the flights
data to one row from the planes
data.
Step 1: Use left_join()
to join planes
to flights
. The common key is tailnum
.
Step 2: Use anti_join()
to see observations from flights
that don’t have a match in planes
and call the output object unmatched_flights
. The common key is tailnum
.
Step 2: Use slice()
and pull()
to extract the tailnum
value in the first row of unmatched_flights
. Call this variable first_unmatched_tailnum
Step 3: Use filter()
to see if first_unmatched_tailnum
is in planes
. Hint: it shouldn’t be!
left_join(
x = flights,
y = planes,
by = "tailnum"
)
# A tibble: 336,776 × 27
year.x month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 19 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, year.y <int>, type <chr>,
# manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
# engine <chr>
<- anti_join(
first_unmatched_tailnum x = flights,
y = planes,
by = "tailnum"
%>%
) slice(1) %>%
pull(tailnum)
%>%
planes filter(tailnum == first_unmatched_tailnum)
# A tibble: 0 × 9
# ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,
# model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
3.8 Custom functions
Sometimes functions don’t exist for desired calculations or we want to combine many calculations into one function to reduce copying-and-pasting.
“You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).” ~ R4DS
R has a flexible function system that makes it very easy to define custom functions!
<- function(arg1, arg2 = default) {
function_name
# function body
}
Three ingredients
- Function name - usually verbs
- Function arguments - inputs to the function (optional)
- Function body
<- function(x = 2) {
square ^ 2
x
}
square()
[1] 4
square(x = 4)
[1] 16
Note: Using tidyverse functions inside of custom functions often requires non-standard evaluation. Please reach out for help when this is your goal.
3.8.1 Exercise 6
Step 1: Write a function called multiply_xy()
that takes arguments x
and y
and multiplies them together.
Step 2: Add your favorite number as the default for x
and your least favorite number as the default for y
.
Step 3: Call the function and overwrite the default for y
with your favorite number.
<- function(x = 3.14, y = 7) {
multiply_xy
* y
x
}
multiply_xy(y = 3.14)
[1] 9.8596
3.9 Resources
- R4DS: functions
stringr
cheat sheet