Chapter 7 - Exploratory Data Analysis
Load the libraries needed for these exercises.
library(tidyverse)
7.3 - Variation
Problem 1
Explore the distribution of each of the x
, y
, and z
variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
The distribution of x
, y
, and z
generally seems to fall between 0 and 10mm, although the distributions of y
and z
both have much longer tails.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = 0.5)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = 0.5)
Problem 2
Explore the distribution of price
. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth
and make sure you try a wide range of values.)
The price
of diamonds appears to peak around $2000, followed by a long tail for the much more expensive diamonds. Narrowing the value of binwidth
shows that some values are not very populated.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 1000)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 500)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 100)
Problem 3
How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
People may prefer to buy a diamond that is a full carat rather than almost a carat large. There appears to be significant rounding in the data set:
diamonds %>%
filter(between(carat, 0.99, 1.00)) %>%
group_by(carat) %>%
count()
## # A tibble: 2 x 2
## # Groups: carat [2]
## carat n
## <dbl> <int>
## 1 0.99 23
## 2 1 1558
Problem 4
Compare and contrast coord_cartesian()
vs xlim()
or ylim()
when zooming in on a histogram. What happens if you leave binwidth
unset? What happens if you try and zoom so only half a bar shows?
Compare and contrast the following three graphs: while coord_cartesian()
will preserve data, ylim()
will drop rows that fall outside of the limits.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0,60))
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
ylim(0,60)
## Warning: Removed 11 rows containing missing values (geom_bar).
7.4 - Missing Values
Problem 1
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
Missing values are plotted in a bar chart but not a histogram. Remember that histograms are generally used to display numeric data, while bar charts are used for categorical data. Missing values can be considered another category to plot in a bar chart, but there is not necessarily an intuitive way to place missing values in a histogram.
diamonds %>%
mutate(cut = ifelse(cut == 'Fair', NA, cut)) %>%
ggplot(aes(x=cut)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1610 rows containing non-finite values (stat_bin).
diamonds %>%
mutate(cut = as.character(cut)) %>%
mutate(cut = ifelse(cut == 'Fair', NA, cut)) %>%
ggplot(aes(x=cut)) +
geom_bar()
Problem 2
What does na.rm = TRUE
do in mean()
and sum()
?
Setting na.rm = TRUE
will remove missing values before executing the function.
x <- c(1, 2, 3, NA)
mean(x)
## [1] NA
mean(x, na.rm = TRUE)
## [1] 2