Utility and Disclosure Risk Metrics and Synthetic Data Case Studies

Published

July 11, 2023

Review

What’s the difference between partially synthetic data and fully synthetic data?

Partially synthetic data contains unaltered and synthesized variables. In partially synthetic data, there remains a one-to-one mapping between confidential records and synthetic records.

Fully synthetic data only contains synthesized variables. Fully synthetic data no longer directly map onto the confidential records, but remain statistically representative.

Sequential synthesis

In a perfect world, we would synthesize data by directly modeling the joint distribution of the variables of interest. Unfortunately, this is often computationally infeasible.

Instead, we often decompose a joint distribution into a marginal distribution and a sequence of conditional distributions.

What’s the difference between specific utility and general utility?

Specific Utility measures the similarity of results for a specific analysis (or analyses) of the confidential and public data (e.g., comparing the coefficients in regression models).

General Utility measures the univariate and multivariate distributional similarity between the confidential data and the public data (e.g., sample means, sample variances, and the variance-covariance matrix).

General Utility Metrics

As a refresher, general utility metrics measure the distributional similarity (i.e., all statistical properties) between the original and synthetic data.
General utility metrics are useful because they provide a sense of how “fit for use” synthetic data is for analysis without making assumptions about the uses of the synthetic data.

Univariate

Categorical variables: frequencies, relative frequencies
Numeric variables means, standard deviations, skewness, kurtosis (i.e., first four moments), percentiles, and number of zero/non-zero values

It is also useful to visually compare univariate distributions using histograms (Figure 1), density plots (Figure 2), and empirical cumulative distribution function plots (Figure 3).

Code

compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_histogram(alpha = 0.3, color = NA, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  scatter_grid()

Figure 1: Compare Synthetic and Confidential Distributions with Histograms

Code

compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3, color = NA) +
  facet_wrap(~variable, scales = "free") +
  scatter_grid()

Figure 2: Compare Synthetic and Confidential Distributions with Density Plots

Code

compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, color = data_source)) +
  stat_ecdf() +
  facet_wrap(~ variable, scales = "free") +
  scatter_grid()

Figure 3: Compare Synthetic and Confidential Distributions with Empirical CDF Plots

Bivariate

Correlation Fit

Correlation fit measures how well the synthesizer recreates the linear relationships between variables in the confidential dataset.

Create correlation matrices for the synthetic data and confidential data. Then measure differences across synthetic and actual data. Those differences are often summarized across all variables using L1 or L2 distance.

Figure 4 shows the creation of a difference matrix. Let’s summarize the difference matrix using mean absolute error. This will give us a sense of how off the average correlation will be in the synthetic data compared to the confidential data.

\[MAE_{dist} = \frac{1}{n}\sum_{i = 1}^n |dist|\]

\[MAE_{dist} = \frac{1}{6} \left(|-0.15| + |0.01| + |0.1| + |-0.15| + |0.15| + |0.02|\right) \approx 0.0966667\]

Advanced measures like relative mutual information can be used to measure the relationships between categorical variables.

Multivariate

Discriminant Based Methods

Discriminant based methods measure well a predictive model can distinguish (i.e., discriminate) between records from the confidential and synthetic data.

The confidential data and synthetic data should theoretically be drawn from the same super population.
The basic idea is to combine (stack) the confidential data and synthetic data and see how well a predictive model distinguish (i.e., discriminate) between synthetic observations and confidential observations.
An inability to distinguish between the records suggests a good synthesis.
It is possible to use logistic regression for the predictive modeling, but decision trees, random forests, and boosted trees are more common.
Figure 5 shows three discriminant based metrics calculated on a good synthesis and a poor synthesis.

Figure 5: A comparison of discriminant metrics on a good synthesis and a poor synthesis

Calculating Discriminant Metrics

pMSE ratio, SPECKS, and AUC all require calculating propensity scores (i.e., the probability that a particular data point belongs to the confidential data) and start with the same step.

Combine the synthetic and confidential data. Add an indicator variable with 0 for the confidential data and 1 for the synthetic data

species	bill_length_mm	sex	ind

Chinstrap	49.5	male	0
...	...	...	...
Adelie	46.0	male	1

Calculate propensity scores (i.e., probabilities for group membership) for whether a given row belong to the synthetic dataset.

species	bill_length_mm	sex	ind	prop_score

Chinstrap	49.5	male	0	0.32
...	...	...	...	...
Adelie	46.0	male	1	0.64

pMSE: Calculates the average Mean Squared Error (MSE) between the propensity scores and the expected probabilities:
Proposed by Woo et al. (Woo et al. 2009) and enhanced by Snoke et al. (Snoke et al. 2018a)

After doing steps 1) and 2) above:

Calculate expected probability, i.e., the share of synthetic data in the combined data. In the cases where the synthetic and confidential datasets are of equal size, this will always be 0.5.

species	bill_length_mm	sex	ind	prop_score	exp_prob

Chinstrap	49.5	male	0	0.32	0.5
...	...	...	...	...	...
Adelie	46.0	male	1	0.64	0.5

Calculate pMSE, which is mean squared difference between the propensity scores and expected probabilities.

\[pMSE = \frac{(0.32 - 0.5)^2 + ... + (0.64-0.5)^2}{N} \]

Often people use the pMSE ratio, which is the average pMSE score across all records, divided by the null model (Snoke et al. 2018b).
The null model is the the expected value of the pMSE score under the best case scenario when the model used to generate the data reflects the confidential data perfectly.
pMSE ratio = 1 means that your synthetic data and confidential data are indistinguishable, although values this low are almost never achieved.

SPECKS: Synthetic data generation; Propensity score matching; Empirical Comparison via the Kolmogorov-Smirnov distance.

After generating propensity scores (i.e., steps 1 and 2 from above), you:

Calculate the empirical CDF’s of the propensity scores for the synthetic and confidential data, separately.
Calculate the Kolmogorov-Smirnov (KS) distance between the 2 empirical CDFs. The KS distance is the maximum vertical distance between 2 empirical CDF distributions.

Receiver Operating Characteristic (ROC) curves show the trade off between false positives and true positives. Area under the curve (AUC) is a single number summary of the ROC curve.

AUC is a common tool for evaluating classification models. High values for AUC are bad because they suggest the model can distinguish between confidential and synthetic observations.

After generating propensity scores (i.e., steps 1 and 2 from above),

In our context, High AUC = good at discriminating = poor synthesis.
We want in the best case, AUC = 0.5 because that means the discriminator is no better than a random guess

Look at Figure 5 to see calculations for pMSE ratio, SPECKS, and AUC.
It is useful to look at variable importance for predictive models when observing poor discriminant based metrics. Variable importance can help diagnose which variables are poorly synthesized.

Exercise 1: Using Utility Metrics

Consider the following two syntheses of x. Which synthesis do you think is better?

Code

set.seed(20230710)
bind_rows(
  synth1 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, mean = 0.2)
  ),
  synth2 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, sd = 0.5)
  ),
  .id = "synthesis"
) |>
  pivot_longer(-synthesis, names_to = "variable") |>
  ggplot(aes(x = value, color = variable)) +
  stat_ecdf() +
  facet_wrap(~ synthesis) +
  scatter_grid()

Both syntheses have issues? What do you think are the issues?

Consider the following two syntheses of x. Which synthesis do you think is better?

Code

set.seed(20230710)
bind_rows(
  synth1 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, mean = 0.2)
  ),
  synth2 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, sd = 0.5)
  ),
  .id = "synthesis"
) |>
  pivot_longer(-synthesis, names_to = "variable") |>
  ggplot(aes(x = value, color = variable)) +
  stat_ecdf() +
  facet_wrap(~ synthesis) +
  scatter_grid()

Both syntheses have issues? What do you think are the issues?

We consider synth1 to be slightly better than synth2 based on the large vertical distances between the lines for synth2.
synth1 looks to match the variance of the confidential data but the mean is a little too high. synth2 matches the mean, but it contains far too little variance. There aren’t enough observations in the tails of the synthetic data.

Exercise 2: Correlation Difference

Consider the following correlation matrices:

[1] "Synthetic"

     [,1] [,2] [,3]
[1,] 1.00  0.5 0.75
[2,] 0.50  1.0 0.80
[3,] 0.75  0.8 1.00

[1] "Confidential"

     [,1] [,2] [,3]
[1,] 1.00 0.35  0.1
[2,] 0.35 1.00  0.9
[3,] 0.10 0.90  1.0

Construct the difference matrix
Calculate MAE
Optional: Calculate RMSE
Optional: What is the main difference between MAE and RMSE?

[1] "Synthetic"

     [,1] [,2] [,3]
[1,] 1.00  0.5 0.75
[2,] 0.50  1.0 0.80
[3,] 0.75  0.8 1.00

[1] "Confidential"

     [,1] [,2] [,3]
[1,] 1.00 0.35  0.1
[2,] 0.35 1.00  0.9
[3,] 0.10 0.90  1.0

Construct the difference matrix

Code

diff <- mat_synth - mat_conf

diff[!lower.tri(diff)] <- NA

diff

     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,] 0.15   NA   NA
[3,] 0.65 -0.1   NA

Calculate MAE

Code

mean(abs(diff[lower.tri(diff)]))

[1] 0.3

Optional: Calculate RMSE

Code

sqrt(mean(diff[lower.tri(diff)] ^ 2))

[1] 0.389444

Optional: What is the main difference between MAE and RMSE?

RMSE gives extra weight to large errors because it squares values instead of using absolute values. We like to think of this as the difference between the mean and the median error.

Specific Utility Metrics

Specific utility metrics measure how suitable a synthetic dataset is for specific analyses.
These specific utility metrics will change from application to application, depending on common uses of the data.
A helpful rule of thumb: general utility metrics are useful for the data synthesizers to be convinced that they’re doing a good job. Specific utility metrics are useful to convince downstream data users that the data synthesizers are doing a good job.

Recreating Inferences

It can be useful to compare statistical analyses on the confidential data and synthetic data:
- Do the estimates have the same sign?
- Do the estimates have the same statistical inference at a common \(\alpha\) level?
- Do the confidence intervals for the estimates overlap?
Each of these questions is useful. Barrientos et al. (2021) combine all three questions into sign, significance, and overlap (SSO) match. SSO is the proportion of times that intervals overlap and have the same sign and significance.

Regression confidence interval overlap:

Regression Confidence Interval Overlap

Regression confidence interval overlap quantifies how well confidence intervals from estimates on the synthetic data recreate confidence intervals from the confidential data.

1 indicates perfect overlap. 0 indicates intervals that are adjacent but not overlapping. Negative values indicate gaps between the intervals.

It is common the compare intervals from linear regression models and logistic regression models.

The interpretability of confidence interval overlap diminishes when disclosure control methods generate very wide confidence intervals.

Microsimulation results

The Urban Institute and Tax Policy Center are heavy users of microsimulation.
When synthesizing administrative tax data, we compare microsimulation results from tax calculators applied to the confidential data and synthetic data. Figure 6 shows results from the 2012 Synthetic Supplement PUF.

Figure 6 compares distributional output from baseline runs. It is also useful to compare tax reforms on the confidential and synthetic data.

Exercise 3: SSO

Suppose we are interested in the following null and alternative hypotheses:

\[H_0: \mu = 0\]

\[H_a: \mu \ne 0\]

Consider the following output:

[1] "Confidential Mean: 2.79266090083311"

[1] "Confidendital Confidence Interval"

[1] 2.308338 3.276984
attr(,"conf.level")
[1] 0.95

[1] "Synthetic Mean: 2.08452909904545"

[1] "Synthetic Confidence Interval"

[1] 1.512416 2.656643
attr(,"conf.level")
[1] 0.95

Do the synthetic data achieve SSO match?

Suppose we are interested in the following null and alternative hypotheses:

\[H_0: \mu = 0\]

\[H_a: \mu \ne 0\]

Consider the following output:

[1] "Confidential Mean: 2.79266090083311"

[1] "Confidendital Confidence Interval"

[1] 2.308338 3.276984
attr(,"conf.level")
[1] 0.95

[1] "Synthetic Mean: 2.08452909904545"

[1] "Synthetic Confidence Interval"

[1] 1.512416 2.656643
attr(,"conf.level")
[1] 0.95

Do the synthetic data achieve SSO match?

Yes! The confidence intervals overlap, the signs are the same, and the statistical significance is the same.

Disclosure Risk Metrics

We now pivot to evaluating the disclosure risks of synthetic data.

Identity Disclosure Metrics

Identity disclosure metrics evaluate how often we correctly re-identify confidential records in the synthetic data.

Note: These metrics require major assumptions about attacker information.

For fully synthetic datasets, there is no one to one relationship between individuals and records so identity disclosure risk is ill-defined. Generally identity disclosure risk applies to partially synthetic datasets (or datasets protected with traditional SDC methods).
Most of these metrics rely on data maintainers essentially performing attacks against their synthetic data and seeing how successful they are at identifying individuals.

Basic matching approaches

We start by making assumptions about the knowledge an attacker has (i.e., external publicly accessible data they have access to).
For each confidential record, the data attacker identifies a set of partially synthetic records which they believe contain the target record (i.e., potential matches) using the external variables as matching criteria.
There are distance-based and probability-based algorithms that can perform this matching. This matching process could be based on exact matches between variables or some relaxations (i.e., matching continuous variables within a certain radius of the target record, or matching adjacent categorical variables).
We then evaluate how accurate our re-identification process was using a variety of metrics.

As a simple example for the metrics we’re about to cover, imagine a data attacker has access to the following external data:

homeworld	species	name

Naboo	Gungan	Jar Jar Binks
Naboo	Droid	R2-D2

And imagine that the partially synthetic released data looks like this:

homeworld	species	skin_color

Tatooine	Human	fair
Tatooine	Droid	gold
Naboo	Droid	white, blue
Tatooine	Human	white
Alderaan	Human	light
Tatooine	Human	light

Note that the released partially synthetic data does not have names. But using some basic matching rules in combination with the external data, an attacker is able to identify the following potential matches for Jar Jar Binks and R2D2, two characters in the Starwars universe:

homeworld	species	skin_color

Potential Jar Jar matches
Naboo	Gungan	orange
Naboo	Gungan	grey
Naboo	Gungan	green
Potential R2-D2 Matches
Naboo	Droid	white, blue

And since we are the data maintainers, we can take a look at the confidential data and know that the highlighted rows are “true” matches.

homeworld	species	skin_color

Potential Jar Jar matches
Naboo	Gungan	orange
Naboo	Gungan	grey
Naboo	Gungan	green
Potential R2-D2 Matches
Naboo	Droid	white, blue

These matches above are counted in various ways to evaluate identity disclosure risk. Below are some of those specific metrics. Generally for a good synthesis, we want a low expected match rate and true match rate, and a high false match rate.

Expected Match Rate: On average, how likely is it to find a “correct” match among all potential matches? Essentially, the expected number of observations in the confidential data expected to be correctly matched by an intruder.
- Higher expected match rate = higher identification disclosure risk.
- The two other risk metrics below focus on the subset of confidential records for which the intruder identifies a single match.
- In our example, this is \(\frac{1}{3} + 1 = 1.333\).

True Match Rate: The proportion of true unique matches among all confidential records. Higher true match rate = higher identification disclosure risk.
Assuming there are 100 rows in the confidential data in our example, this is \(\frac{1}{100} = 1\%\).

False Match Rate: The proportion of false matches among the set of unique matches. Lower false match rate = higher identification disclosure risk.
In our example, this is \(\frac{0}{1} = 0\%\).

Attribute Disclosure risk metrics

We were able to learn about Jar Jar and R2D2 by re-identifying them in the data. It is possible to learn confidential attributes without perfectly re-identifying observations in the data.

Predictive Accuracy

Predictive accuracy measures how well an attacker can learn about attributes in the confidential data using the synthetic data (and possibly external data).

Similar to above, you start by matching synthetic records to confidential records. Alternatively, you can build a predictive model using the synthetic data to make predictions on the confidential data.
key variables: Variables that an attacker already knows about a record and can use to match.
target variables: Variables that an attacker wishes to know more or infer about using the synthetic data.
Pick a sensitive variable in the confidential data and use the synthetic data to make predictions. Evaluate the accuracy of the predictions.

Membership Inference Tests

Memebership Inference Test

Membership inference tests explore how well an attacker can determine if a given observations was in the training data for the synthetic data.

Why is this important? Sometimes membership in a synthetic dataset is also confidential (e.g., a dataset of HIV positive patients or people who have experienced homelessness).
Also particularly useful for fully synthetic data where identity disclosure and attribute disclosure metrics don’t really make a lot of sense.
Assumes that attacker has access to a subset of the confidential data, and wants to tell if one or more records was used to generate the synthetic data.
Since we as data maintainers know the true answers, we can evaluate whether the attackers guess is true and can break it down many ways (e.g., true positives, true negatives, false positives or false negatives).

source for figure: Mendelevitch and Lesh (2021)

The “close enough” threshold is usually determined by a custom distance metric, like edit distance between text variables or numeric distance between continuous variables.
Often you will want to choose different distance thresholds and evaluate how your results change.

Copy Protection

Copy Protection Metrics

Copy protection metrics measure how often the synthesizer memorizes or inadvertantly duplicates confidential records.

Distance to Closest record: Measures distance between each real record (\(r\)) and the closest synthetic record (\(s_i\)), as determined by a distance calculation.
Many common distance metrics used in the literature including Euclidean distance, cosine distance, Gower distance, or Hamming distance (Mendelevitch and Lesh 2021).
Goal of this metric is to easily expose exact copies or simple perturbations of the real records that exist in the synthetic dataset.

Note that having DCR = 0, doesn’t necessarily mean a high disclosure risk because in some datasets the “space” spanned by the variables in scope is relatively small.

Hold Out Data

Holdout Data

Membership inference tests and copy protection metrics are informative but lack context. When possible, create a holdout data set similar to the training data. Then calculate membership inference tests and copy protections metrics replacing the synthetic data with the hold out data. The results are useful for benchmarking the original membership inference tests and copy protection metrics.

Exercise 4: Disclosure Metrics

Figure 7: Attacker information and partially synthetic data

Name	Year	Elective

Adam	2009	Chorus
Betsy	2010	Band

Attacker Information

Year	Elective	Synthetic SAT

2008	Chorus	1100
2008	Chorus	1420
2009	Chorus	900
2009	Band	1100
2010	Band	1420
2010	Band	900
2010	Band	1200

Partially Synthetic Data

Are there any matches for Adam?
Are there any matches for Betsy?
What risks are created by the release?

Figure 8: Attacker information and partially synthetic data

Name	Year	Elective

Adam	2009	Chorus
Betsy	2010	Band

Attacker Information

Year	Elective	Synthetic SAT

2008	Chorus	1100
2008	Chorus	1420
2009	Chorus	900
2009	Band	1100
2010	Band	1420
2010	Band	900
2010	Band	1200

Partially Synthetic Data

Are there any matches for Adam?

Using Year and Elective as key variables, Adam has a unique match (highlighted in pink)

Are there any matches for Betsy?

Using Year and Elective as key variables, Betsy has a three matches (highlighted in yellow)

What risks are created by the release?

It is tough to say without context but here are a few considerations:

Is SAT easily observable outside of the data?
Are the values of Synthetic SAT close to the true values for SAT?
Are SAT and Synthetic SAT likely to be close under random guessing because it has low sample variance?

Case Studies

Fully Synthetic PUF for IRS Non-Filers (Bowen et al. 2020)

Data: A 2012 file of “non-filers” created by the IRS Statistics of Income Division.
Motivation: Non-filer information is important for modeling certain tax reforms and this was a proof-of-concept for a more complex file.
Methods: Sequential CART models with additional noise added based on the sparsity of nearby observations in the confidential distribution.
Important metrics:
- General utility: Proportions of non-zero values, first four moments, correlation fit
- Specific utility: Tax microsimulation, regression confidence interval overlap
- Disclosure: Node heterogeneity in the CART model, rates of recreating observations
Lessons learned:
- Synthetic data can work well for tax microsimulation.
- It is difficult to match certain utility metrics for sparse variables.

Fully Synthetic SIPP data (Benedetto et al. 2018)

Data: Survey of Income and Program Participation linked to administrative longitudinal earnings and benefits data from IRS and SSA.
Motivation: To expand access to detailed economic data that is highly restricted without heavy disclosure control.
Methods: Sequential regression multiple imputation (SRMI) with OLS regression, logistic regression, and Bayesian bootstrap. They released four implicates of the synthetic data.
Important metrics:
- General utility: pMSE
- Specific utility: None
- Disclosure: Distance based re-identification, RMSE of the closest record to measure attribute disclosure
Lessons learned:
- One of the first major synthetic files in the US.
- The file includes complex relationships between family members that are synthesized.

Partially Synthetic Geocodes (Drechsler and Hu 2021)

Data: Integrated Employment Biographies (German administrative data) with linked geocodes (latitude and longitude)
Motivation: Rich geographic information can be used to answer many important labor market research questions. This data would otherwise would be too sensitive to release, due to the possibility of identifying an individual based on the combination of their location and other attributes.
Methods: CART with categorical geocodes. Also evaluated CART with continuous geocodes and a Bayesian latent class model.
Important metrics:
- General utility: Relative frequencies of cross tabulations
- Specific utility: Zip Code comparisons of tabulated variables, Ripley’s K- and L-functions
- Disclosure: Probabilities of re-identification (Reiter and Mitra, 2009) -> comparison of expected match risk and the true match rate
Lessons learned:
- The synthetic data with geocodes had more measured disclosure risk than the original data.
- Synthesizing more variables made a huge difference in the measured disclosure risks.
- Adjusting CART hyperparameters was not an effective way to manage the risk-utility tradeoff.
- They stratified the data before synthesis for computational reasons.

Utility and Disclosure Risk Metrics and Synthetic Data Case Studies

Review

General Utility Metrics

Univariate

Bivariate

Multivariate

Calculating Discriminant Metrics

Exercise 1: Using Utility Metrics

Exercise 2: Correlation Difference

Specific Utility Metrics

Recreating Inferences

Regression confidence interval overlap:

Microsimulation results

Exercise 3: SSO

Disclosure Risk Metrics

Identity Disclosure Metrics

Basic matching approaches

Attribute Disclosure risk metrics

Predictive Accuracy

Membership Inference Tests

Copy Protection

Hold Out Data

Exercise 4: Disclosure Metrics

Case Studies

Fully Synthetic PUF for IRS Non-Filers (Bowen et al. 2020)

Fully Synthetic SIPP data (Benedetto et al. 2018)

Partially Synthetic Geocodes (Drechsler and Hu 2021)

Suggested Reading

References