palmerpenguins
DataThe palmerpenguins
package contains data about three
species of penguins collected from three islands in the Palmer
Archipelago, Antarctica. We will use an adapted version of the dataset
to demonstrate some of the concepts discussed above.
# create dataset we will be using
penguins <- penguins %>%
filter(species == "Adelie") %>%
select(
sex,
bill_length_mm,
flipper_length_mm
) %>%
drop_na()
penguins %>%
head() %>%
create_table()
sex | bill_length_mm | flipper_length_mm |
---|---|---|
male | 39.1 | 181 |
female | 39.5 | 186 |
female | 40.3 | 195 |
female | 36.7 | 193 |
male | 39.3 | 190 |
female | 38.9 | 181 |
The above code simplifies the dataset to only three variables and
removes missing values in those variables. We will synthesize the
sex
, bill_length_mm
, and
flipper_length_mm
variables in this dataset using some of
the methods discussed above. Since we are synthesizing all three
variables, our final version of the dataset is considered fully
synthetic.
sex
variableLet’s start by synthesizing sex
, which is a binary
variable that can take a value of either “male” or “female”. To
synthesize this variable, we will identify the underlying percentages of
the data that fall into each category and use it to generate records
that mimic the properties of the confidential data.
# identify percentage of total that each category (sex) makes up
penguins %>%
count(sex) %>%
mutate(relative_frequency = n / sum(n)) %>%
create_table()
sex | n | relative_frequency |
---|---|---|
female | 73 | 0.5 |
male | 73 | 0.5 |
Using these proportions, we will now randomly sample with replacement to mimic the underlying distribution of gender.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# vector of gender categories
sex_categories <- c('female', 'male')
# size of sample to generate
synthetic_data_size <- nrow(penguins)
# probability weights
sex_probs <- penguins %>%
count(sex) %>%
mutate(relative_frequency = n / sum(n)) %>%
pull(relative_frequency)
# use sample function to generate synthetic vector of genders
sex_synthetic <- sample(
x = sex_categories,
size = synthetic_data_size,
replace = TRUE,
prob = sex_probs
)
Our new sex_synthetic
variable will form the foundation
of our synthesized data.
# use vector to generate synthetic gender column
penguins_synthetic <- tibble(
sex = sex_synthetic
)
penguins_synthetic %>%
head() %>%
create_table()
sex |
---|
female |
male |
female |
male |
female |
male |
bill_length_mm
variableUnlike sex
, bill_length_mm
is numeric.
summary(penguins$bill_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32.10 36.73 38.85 38.82 40.77 46.00
To synthesize this variable, we are going to predict the
bill_length_mm
for each penguin using a linear regression
with sex
as a predictor.
Note that sex
is a factor variable with possible values
of “male” and “female” which can’t directly be used in a regression. So
under the hood, R converts that factor variable into a binary numeric
variables (i.e. 0
or 1
), and then runs the
regression.
# linear regression
bill_length_lm <- lm(
formula = bill_length_mm ~ sex,
data = penguins
)
summary(bill_length_lm)
##
## Call:
## lm(formula = bill_length_mm ~ sex, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.790 -1.357 0.076 1.393 5.610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2575 0.2524 147.608 < 2e-16 ***
## sexmale 3.1329 0.3570 8.777 4.44e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.157 on 144 degrees of freedom
## Multiple R-squared: 0.3485, Adjusted R-squared: 0.344
## F-statistic: 77.03 on 1 and 144 DF, p-value: 4.44e-15
Now that we have coefficients for the linear regression, we can
generate our synthetic values. First, we try a straightforward
prediction of bill lengths using the synthetic sex
variable.
# predict bill length with model coefficients
bill_length_synthetic_method1 <- predict(
object = bill_length_lm,
newdata = penguins_synthetic
)
# add predictions to synthetic data as bill_length
penguins_synthetic_method1 <- penguins_synthetic %>%
mutate(bill_length_mm = bill_length_synthetic_method1)
And now we compare the univariate distributions of the confidential data to our newly synthesized variable with a graph.
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic_method1,
.id = "data_source"
)
# plot comparison of bill_length_mm distributions
compare_penguins %>%
select(data_source, bill_length_mm) %>%
pivot_longer(-data_source, names_to = "variable") %>%
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Method 1") +
scatter_grid()
Simply using the predicted values from our linear regression does not give us enough variation in the synthetic variable.
To understand more about the predictions made from the linear
regression model, let’s dig into the predicted values for the first five
rows of data and the corresponding synthetic sex
.
# Look at first few rows of synthetic data
penguins_synthetic_method1 %>%
head() %>%
create_table()
sex | bill_length_mm |
---|---|
female | 37.25753 |
male | 40.39041 |
female | 37.25753 |
male | 40.39041 |
female | 37.25753 |
male | 40.39041 |
We know from our regression analysis output from above that the
intercept (\(\beta_0\)) is 42.1 and the
coefficient for a male penguin (\(\beta_1\)) is 3.8. Therefore, if the
penguin is male, we have a predicted value (\(\hat{y}\)) of (42.1 + 3.8 = )
45.9. If the penguin is female, our predicted value (\(\hat{y}\)) is only the intercept, 42.1.
Because sex
will only take two values, our synthetic
bill_length_mm
also only takes two values. The model fit
limits the variation that is possible, making our synthetic variable
significantly less useful.
Instead, we can try a second method. We will create a version of the variable, where the synthetic value is a draw from a normal distribution with a mean of the regression line for the given predictions, and standard deviation equal to the residual standard error.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# predict bill length with model coefficients
bill_length_predicted <- predict(
object = bill_length_lm,
newdata = penguins_synthetic
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
bill_length_synthetic_method2 <- rnorm(
n = nrow(penguins_synthetic),
mean = bill_length_predicted,
sd = sigma(bill_length_lm)
)
# add predictions to synthetic data as bill_length
penguins_synthetic_method2 <- penguins_synthetic %>%
mutate(bill_length_mm = bill_length_synthetic_method2)
Now, we again compare the univariate distributions of the confidential data and the synthetic data we generated with this second method.
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic_method2,
.id = "data_source"
)
# plot comparison of bill_length_mm distributions
compare_penguins %>%
select(data_source, bill_length_mm) %>%
pivot_longer(-data_source, names_to = "variable") %>%
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Method 2") +
scatter_grid()
We have much more variation with this new method, though the
distributions still do not match perfectly. We choose this method’s
output to add as the synthetic bill_length_mm
in our final
synthesized dataset. And now our synthesized dataset has two
columns.
# using method 2 as synthesized variable
penguins_synthetic <- penguins_synthetic %>%
mutate(bill_length_mm = bill_length_synthetic_method2)
penguins_synthetic %>%
head() %>%
create_table()
sex | bill_length_mm |
---|---|
female | 38.76954 |
male | 42.85524 |
female | 38.52128 |
male | 38.15742 |
female | 35.04525 |
male | 39.22174 |
flipper_length_mm
The flipper_length_mm
variable is also numeric, so we
can follow the same steps we used to synthesize
bill_length_mm
.
summary(penguins$flipper_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 172.0 186.0 190.0 190.1 195.0 210.0
This time, we regress flipper_length_mm
on both
sex
and bill_length_mm
.
# linear regression
flipper_length_lm <- lm(
formula = flipper_length_mm ~ sex + bill_length_mm,
data = penguins
)
summary(flipper_length_lm)
##
## Call:
## lm(formula = flipper_length_mm ~ sex + bill_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0907 -2.8058 -0.0534 3.3284 15.8789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 170.6183 8.7497 19.500 <2e-16 ***
## sexmale 3.1721 1.2422 2.554 0.0117 *
## bill_length_mm 0.4610 0.2341 1.970 0.0508 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.058 on 143 degrees of freedom
## Multiple R-squared: 0.1492, Adjusted R-squared: 0.1373
## F-statistic: 12.54 on 2 and 143 DF, p-value: 9.606e-06
Since we already know we prefer the method using draws from the normal distribution centered on the mean of the regression line, we will default to that.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# predict flipper length with model coefficients
flipper_length_predicted <- predict(
object = flipper_length_lm,
newdata = penguins_synthetic
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
flipper_length_synthetic <- rnorm(
n = nrow(penguins_synthetic),
mean = flipper_length_predicted,
sd = sigma(flipper_length_lm)
)
# add predictions to synthetic data as flipper_length
penguins_synthetic <- penguins_synthetic %>%
mutate(flipper_length_mm = flipper_length_synthetic)
With all three synthesized variables, we can compare the the
univariate distributions of the confidential data and the synthetic data
(we have already done this for bill_length_mm
).
# create dataframe with both confidential and synthesized values
compare_penguins <- bind_rows(
"confidential" = penguins,
"synthetic" = penguins_synthetic,
.id = "data_source"
)
# Write out final compare_penguins df for use in future exercises
dir.create(here::here("data/"), showWarnings = FALSE)
compare_penguins %>% write_csv(here::here("data", "penguins_synthetic_and_confidential.csv"))
First, we can compare the distributions of the sex
variable in the confidential and synthetic data.
sex_comparison <- compare_penguins %>%
select(data_source, sex) %>%
count(data_source, sex) %>%
group_by(data_source) %>%
mutate(relative_frequency = n / sum(n))
sex_comparison %>%
ggplot(aes(x = n, y = sex, fill = data_source)) +
geom_text(aes(label = n),
position = position_dodge(width = 0.5),
hjust = -0.4) +
geom_col(position = "dodge", alpha = 0.7) +
scale_x_continuous(expand = expansion(add = c(0, 15))) +
labs(y = "", x = "N", title = "Comparison of sex distribution")
# plot comparison of distributions
compare_penguins %>%
select(
data_source,
bill_length_mm,
flipper_length_mm
) %>%
pivot_longer(-data_source, names_to = "variable") %>%
ggplot(aes(x = value, fill = data_source)) +
geom_density(alpha = 0.3) +
facet_wrap(~variable, scales = "free") +
labs(title = "Comparison of Univariate Distributions",
subtitle = "Final synthetic product") +
scatter_grid()
sex
synthesissex
variable that was the same
length as the confidential data. Was this necessary for the method to be
applied correctly?Was the method we used parametric or nonparametric? Why?
What do we notice about the synthetic variable compared to the original?
We generated a new sex
variable that was the same
length as the confidential data. Was this necessary for the method to be
applied correctly?
bill_length_mm
synthesisWas the method we just used parametric or nonparametric? Why?
What do we notice about the synthetic variable compared to the original?
Are we synthesizing these data sequentially? How do you know?
Was the method we just used parametric or nonparametric? Why?
What do we notice about the synthetic variable compared to the original?
Are we synthesizing these data sequentially? How do you know?
sex
variable
was used as a predictor.flipper_length_mm
synthesisWhat do we notice about the synthetic variable compared to the original?
What are benefits and drawbacks to generating this variable sequentially?
Question 4
For this exercise, we will use the starwars
dataset from
the dplyr
package. We will practice sequentially
synthesizing a binary variable (gender
) and a numeric
variable (height
).
# run this to get the dataset we will work with
starwars <- dplyr::starwars %>%
select(gender, height) %>%
drop_na()
starwars %>%
head() %>%
create_table()
gender | height |
---|---|
masculine | 172 |
masculine | 167 |
masculine | 96 |
masculine | 202 |
feminine | 150 |
masculine | 178 |
Question 1: Gender synthesis
Fill in the blanks in the following code to synthesize the
gender
variable using the underlying distribution present
in the data.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Fill in the blanks!
# vector of gender categories
gender_categories <- c('feminine', 'masculine')
# size of sample to generate
synthetic_data_size <- nrow(starwars)
# probability weights
gender_probs <- starwars %>%
count(gender) %>%
mutate(relative_frequency = ### ______) %>%
pull(relative_frequency)
# use sample function to generate synthetic vector of genders
gender_synthetic <- sample(
x = ###_____,
size = ###_____,
replace = ###_____,
prob = ###_____
)
# create starwars_synthetic dataset using generated variable
starwars_synthetic <- tibble(
gender = gender_synthetic
)
Question 2: Height synthesis
Similarly, fill in the blanks in the code to generate the
height
variable using a linear regression with
gender
as a predictor.
# set a seed so pseudo-random processes are reproducible
set.seed(20220301)
# Fill in the blanks!
# linear regression
height_lm <- lm(
formula = ###_____,
data = ###______
)
# predict flipper length with model coefficients
height_predicted <- predict(
object = height_lm,
newdata = ###_____
)
# synthetic column using normal distribution centered on predictions with sd of residual standard error
height_synthetic <- rnorm(
n = ###_______,
mean = ###______,
sd = ###______
)
# add new values to synthetic data as height
starwars_synthetic <- starwars_synthetic %>%
mutate(height = height_synthetic)