Fill in the grey and green blanks in the following diagram with the terms you learned above. There’s also 1 mistake, see if you can spot it.
Let’s say a researcher generates a synthetic version of a dataset on penguins species. The first 5 rows of the gold standard dataset looks like this:
species | bill_length_mm | sex |
---|---|---|
Chinstrap | 51.3 | male |
Gentoo | 44.0 | female |
Chinstrap | 51.4 | male |
Chinstrap | 45.4 | female |
Adelie | 36.2 | female |
One of the metrics to assess data utility was the overall counts of penguin species across the synthetic and gold standard data, which look like this:
species | # conf. data | # synthetic data |
---|---|---|
Adelie | 152 | 138 |
Chinstrap | 68 | 68 |
Gentoo | 124 | 116 |
Question 1: Would the above counts be considered a global utility metric or a specific utility metric and why?
Question 2: Researchers looked at telephone metadata, which included times, duration and outgoing numbers of telephone calls. They found that 1 of the records in the data placed frequent calls to a local firearm dealer that prominently advertises a specialty in the AR semiautomatic rifle platform. The participant also placed lengthy calls to the customer support hotline for a major firearm manufacturer which produces a popular AR line of rifles. Using publicly available data, they were able to confirm the participants identity and confirm that he owned an AR-15. In this example what kinds of disclosures happened? (Hint: there were two!)
What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?
What are the privacy implications for releasing multiple versions of a synthetic dataset (implicates)? Do these implications change for partially vs. fully synthetic data?
Releasing multiple implicates improves transparency and analytical value, but increases disclosure risk (violates “security through obscurity”).
It is more risky to release partially synthetic implicates, since non-synthesized records are the same across each dataset and there remains a 1-to-1 relationship between confidential and synthesized records.
What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?
What are the trade-offs of a partially synthetic dataset compared to a fully synthetic dataset?
Changing only some variables (partial synthesis) in general leads to higher utility in analysis since the relationships between variables are by definition unchanged (Drechsler et al, 2008).
Disclosure in fully synthetic data is nearly impossible because all values are imputed, while partial synthesis has higher disclosure risk since confidential values remain in the dataset (Drechsler et al, 2008).
Accurate and exhaustive specification of variable relationships and constraints in fully synthetic data is difficult and if done incorrectly can lead to bias.
Partially synthetic data may be publicly perceived as more reliable than fully synthetic data.
Imagine you are in charge of safeguarding a dataset against an intruder. Brainstorm and discuss features of the intruder that you would consider a “worst-case scenario” in terms of privacy (short of the intruder having access to the entire confidential dataset).