Exercise 1

Question

Fill in the grey and green blanks in the following diagram with the terms you learned above. There’s also 1 mistake, see if you can spot it.



Solution



Exercise 2

Let’s say a researcher generates a synthetic version of a dataset on penguins species. The first 5 rows of the gold standard dataset looks like this:

species bill_length_mm sex
Chinstrap 51.3 male
Gentoo 44.0 female
Chinstrap 51.4 male
Chinstrap 45.4 female
Adelie 36.2 female


One of the metrics to assess data utility was the overall counts of penguin species across the synthetic and gold standard data, which look like this:

species # conf. data # synthetic data
Adelie 152 138
Chinstrap 68 68
Gentoo 124 116


Question 1: Would the above counts be considered a global utility metric or a specific utility metric and why?

Question 2: Researchers looked at telephone metadata, which included times, duration and outgoing numbers of telephone calls. They found that 1 of the records in the data placed frequent calls to a local firearm dealer that prominently advertises a specialty in the AR semiautomatic rifle platform. The participant also placed lengthy calls to the customer support hotline for a major firearm manufacturer which produces a popular AR line of rifles. Using publicly available data, they were able to confirm the participants identity and confirm that he owned an AR-15. In this example what kinds of disclosures happened? (Hint: there were two!)



Exercise 3

Implicates

Question

What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?



Question Notes

What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?

  • Releasing multiple implicates improves transparency and analytical value, but increases disclosure risk (violates “security through obscurity”).

  • It is more risky to release partially synthetic implicates, since non-synthesized records are the same across each dataset and there remains a 1-to-1 relationship between confidential and synthesized records.



Partial vs. fully synthetic

Question

What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?



Question Notes

What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?

  • Changing only some variables (partial synthesis) in general leads to higher utility in analysis since the relationships between variables are by definition unchanged (Drechsler et al, 2008).

  • Disclosure in fully synthetic data is nearly impossible because all values are imputed, while partial synthesis has higher disclosure risk since confidential values remain in the dataset (Drechsler et al, 2008).

    • Note that while the risk of disclosure for fully synthetic data is very low, it is not zero.
  • Accurate and exhaustive specification of variable relationships and constraints in fully synthetic data is difficult and if done incorrectly can lead to bias.

    • If a variable is synthesized incorrectly early in a sequential synthesis, all variables synthesized on the basis of that variable will be affected.
  • Partially synthetic data may be publicly perceived as more reliable than fully synthetic data.



Exercise 4

Question

Imagine you are in charge of safeguarding a dataset against an intruder. Brainstorm and discuss features of the intruder that you would consider a “worst-case scenario” in terms of privacy (short of the intruder having access to the entire confidential dataset).



Hints

  • How much computational power might they have?
  • Might they have access to other information about the observations?
  • Might they have access to other, supplemental datasets?