Chapter 15 - Factors
Load the libraries needed for these exercises.
library(tidyverse)
library(forcats)
15.4 - Modifying factor order
Problem 1
There are some conspicuously high numbers in tvhours
. Is the mean a good summary?
Exteme values bias the mean. tvhours
is skewed to the right and median may be a better summary statistic.
gss_cat %>%
ggplot(aes(tvhours)) +
geom_histogram()
mean(gss_cat$tvhours, na.rm = TRUE)
## [1] 2.980771
median(gss_cat$tvhours, na.rm = TRUE)
## [1] 2
Problem 2
For each factor in gss_cat
identify whether the order of the levels is arbitrary or principled.
marital
: arbitraryage
: principledrace
: arbitraryrincome
: principledpartyid
: arbitraryrelig
: arbitrarydenom
: arbitrary
Problem 3
Why did moving “Not Applicable” to the front of the levels move it to the bottom of the plot?
Factors are shown in reverse order from top to bottom when coord_flip
is used.
gss_cat %>%
mutate(race = fct_relevel(race, "Not applicable")) %>%
ggplot(aes(race)) +
geom_bar() +
coord_flip() +
scale_x_discrete(drop = FALSE)
15.5 - Modifying factor levels
Problem 1
How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat"))) %>%
filter(partyid != "other") %>%
ggplot(aes(x = year, ..prop.., fill = partyid)) +
geom_bar(position = "dodge") +
scale_x_continuous(breaks = c(2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014))
Problem 2
How could you collapse rincome
into a small set of categories?
gss_cat %>%
mutate(rincome = fct_collapse(rincome,
`More than $10,000` = c("$10000 - 14999", "$15000 - 19999", "$20000 - 24999", "$25000 or more"),
`Less than $10,000` = c("Lt $1000", "$1000 to 2999", "$3000 to 3999", "$4000 to 4999", "$5000 to 5999", "$6000 to 6999", "$7000 to 7999", "$8000 to 9999"))) %>%
mutate(rincome = fct_lump(rincome, n = 3)) %>%
count(rincome, sort = TRUE)
## # A tibble: 4 x 2
## rincome n
## <fct> <int>
## 1 More than $10,000 10862
## 2 Not applicable 7043
## 3 Less than $10,000 2153
## 4 Other 1425