Chapter 7 - Exploratory Data Analysis
Load the libraries needed for these exercises.
library(tidyverse)7.3 - Variation
Problem 1
Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
The distribution of x, y, and z generally seems to fall between 0 and 10mm, although the distributions of y and z both have much longer tails.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = 0.5)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = 0.5)
Problem 2
Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
The price of diamonds appears to peak around $2000, followed by a long tail for the much more expensive diamonds. Narrowing the value of binwidth shows that some values are not very populated.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 1000)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 500)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 100)
Problem 3
How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
People may prefer to buy a diamond that is a full carat rather than almost a carat large. There appears to be significant rounding in the data set:
diamonds %>%
filter(between(carat, 0.99, 1.00)) %>%
group_by(carat) %>%
count()## # A tibble: 2 x 2
## # Groups: carat [2]
## carat n
## <dbl> <int>
## 1 0.99 23
## 2 1 1558
Problem 4
Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
Compare and contrast the following three graphs: while coord_cartesian() will preserve data, ylim() will drop rows that fall outside of the limits.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0,60))
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
ylim(0,60)## Warning: Removed 11 rows containing missing values (geom_bar).

7.4 - Missing Values
Problem 1
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
Missing values are plotted in a bar chart but not a histogram. Remember that histograms are generally used to display numeric data, while bar charts are used for categorical data. Missing values can be considered another category to plot in a bar chart, but there is not necessarily an intuitive way to place missing values in a histogram.
diamonds %>%
mutate(cut = ifelse(cut == 'Fair', NA, cut)) %>%
ggplot(aes(x=cut)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1610 rows containing non-finite values (stat_bin).

diamonds %>%
mutate(cut = as.character(cut)) %>%
mutate(cut = ifelse(cut == 'Fair', NA, cut)) %>%
ggplot(aes(x=cut)) +
geom_bar()
Problem 2
What does na.rm = TRUE do in mean() and sum()?
Setting na.rm = TRUE will remove missing values before executing the function.
x <- c(1, 2, 3, NA)
mean(x)## [1] NA
mean(x, na.rm = TRUE)## [1] 2