5  Utility Evaluation

Published

April 23, 2026

Abstract
This chapter introduces evaluating the utility of synthetic data.
Figure 5.1
A person, typing on a laptop with glasses and a cell phone on the side.
NoteOverview

In this chapter, you will learn about:

  • The different types of utility risk metrics used to evaluate synthetic data.
  • How to apply these metrics through both conceptual questions and hands-on computational exercises.
  • Key considerations and best practices when interpreting and using these metrics in real-world contexts.
Note

We will assume the original, confidential, and GSDS are identical for this set of notes. In practice, it is necessary to pick the appropriate comparison dataset for evaluating the synthetic dataset.

5.1 Motivations for Utility Evaluation

TipDefinition: Data utility, Quality, Accuracy, or Usefulness

Data utility, quality, accuracy, or usefulness is how useful or accurate the data are for research and analysis purposes.

TipDefinition: Utility Metrics for Synthetic Data

Utility metrics are metrics that measure a synthetic dataset’s degree of usefulness for downstream data processing.

Why produce utility metrics?

  1. Assess how synthetic data can or cannot be used.
    • e.g., can I use synthetic data to estimate the effect of degree type on migration in and out of the state?
    • e.g., can I use synthetic data to estimate the number of students with subsidized healthcare coverage?
  2. Motivate moving the needle on the privacy-utility trade-off.
    • e.g., does the relationship between degree type and migration need to be more accurate so synthetic data can be useful to practitioners?
    • e.g., is the estimate of the number of students with subsidized healthcare coverage accurate enough that I could include more noise in that variable?
WarningWARNING: Utility metric motivation

Reason #1 describes what synthetic data can and cannot do as a mathematical description.

Reason #2 describes what synthetic data should or should not do as a policy choice for data curators.

Figure 5.2: Comparison of generalization and privacy-utility trade-off as model complexity increases.
A graphic illustrating the trade-offs in model development, composed of two parts: “Generalization” and “Privacy-Utility Trade-off.” The left chart, titled “Generalization,” plots “Error” on the y-axis against “Model Complexity” on the x-axis. It features two lines: A blue “Training Data” line shows error consistently decreasing as model complexity increases. A yellow “Test Data” line forms a U-shape, showing error decreasing to a minimum before increasing again as the model becomes more complex, which illustrates overfitting. The right side, titled “Privacy-Utility Trade-off,” contains two charts stacked vertically over the same “Model Complexity” x-axis: The bottom chart shows “Error” on its y-axis, with a blue line decreasing as complexity increases. The top chart shows “Disclosure Risk” on its y-axis, with a yellow line that curves upward, indicating risk increases with model complexity. A vertical, dashed pink line labeled “Policy decision on the trade-off” cuts across both right-hand charts, signifying a chosen balance point between model utility (lower error) and privacy (lower disclosure risk).

A main goal of synthesis is to learn the data generation process of the population or superpopulation from an observed dataset while minimizing the amount of information learned about individual observations in that data. In this way, avoiding memorizing information about individual records aids both generalization and disclosure risk protections.

  • The left panel of Figure 5.2 shows how increasing model complexity leads to training data memorization and an inability for models to generalize to new data.
  • The right panel of Figure 5.2 shows how increasing model complexity leads to confidential data memorization and increased disclosure risks.

It is important to measure data utility to understand what synthetic data can and can’t do and to understand where the synthetic data fall in the privacy-utility trade-off. However, it is also important to understand that targeting what synthetic data should do is a policy choice for data curators.

5.2 Statistical and Administrative Uses

Two broad categories of data use:

  • Statistical data use refers to using summaries to infer statistical features of an implicit or explicit population or superpopulation.

  • Administrative data use refers to using summaries to describe features of the data for direct use of the data.

Why does this distinction matter? Many policies and technologies attempt to draw hard lines between these two!

For example…

  1. Confidential Information Protection and Statistical Efficiency Act (CIPSEA): “To ensure that information supplied by individuals or organizations to an agency for statistical purposes under a pledge of confidentiality is used exclusively for statistical purposes.”

  2. Differential Privacy’s risk measures: “Statistical data uses rely on ‘data about you’ and DON’T concern privacy, whereas administrative data uses rely on ‘your data’ and DO concern privacy.”

ImportantIMPORTANT: Challenges in distinguishing between statistical and administrative purpose

The same summaries that enable statistical data uses will always enable some administrative data uses, and contextual evaluation is necessary to strike the right balance between the two.

5.3 General Utility Metrics

TipDefinition: General, Global, or Generic Utility

General utility metrics measure differences between synthetic and confidential data independent of pre-specified use cases.

General utility metrics…

  • measure the distributional similarity between the confidential and synthetic data, typically using empirical distributions.
  • are useful because they provide a sense of how “fit for use” synthetic data are for analysis without making assumptions about the uses of the synthetic data.

Example questions:

  • “How similar is the empirical distribution of GPA between confidential and synthetic data?”
  • “How easily could one build a model to distinguish between confidential and synthetic GPA values?”

Example metrics:

  • Distributional distances (e.g., empirical cumulative distance functions, L-norm distances, Wasserstein distances)
  • Distributional visualizations (e.g., histograms, density estimates, empirical cumulative distribution functions)

5.3.1 Univariate

  • Comparing univariate distributions is a basic approach to evaluating utility. For example, Figure 5.3 shows a comparison of means for tax variables for a synthetic file released by the IRS Statistics of Income division.

  • Categorical variables: frequencies, relative frequencies.

  • Numeric variables means, standard deviations, skewness, kurtosis (i.e., first four moments), percentiles, and number of zero/non-zero values.

For the univariate case, we could calculate the frequencies and relative frequencies of the categorical variables in the synthetic and confidential data. When the variables are numeric, we could compute the means, standard deviations, skewness, kurtosis, percentiles, and number of zero/non-zero values. We can also visually compare the results of the univariate distributions from the synthetic and confidential data using a histogram, density plots, or empirical cumulative distribution function (eCDF) plots.

The eCDF comparison of the synthetic data and confidential data is useful for identifying differences in both central values and the tails. In other words, discrepancies may arise not only in the central region but also in the tails; specifically where the eCDF approaches 0 or 1.

Woo et al. (2009b) propose the use of the eCDF as a global utility metric, which is particularly useful for univariate numeric variables. While the metric can be easily extended to multivariate distributions, implementing the eCDF estimation itself in multivariate cases is not straightforward. Therefore, the eCDF global utility metric is rarely used beyond univariate cases.

Figure 5.3: Example Comparison of Means from IRS SOI

It is also useful to visually compare univariate distributions using histograms (Figure 5.4), density plots (Figure 5.5), or empirical cumulative distribution function plots (Figure 5.6). The following examples use the Palmer penguins data.

Code
compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_histogram(alpha = 0.3, color = NA, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  scatter_grid()
Figure 5.4: Compare Synthetic and Confidential Distributions with Histograms
Code
compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, fill = data_source)) +
  geom_density(alpha = 0.3, color = NA) +
  facet_wrap(~variable, scales = "free") +
  scatter_grid()
Figure 5.5: Compare Synthetic and Confidential Distributions with Density Plots
Code
compare_penguins |>
  select(
    data_source, 
    bill_length_mm, 
    flipper_length_mm
  ) |>
  pivot_longer(-data_source, names_to = "variable") |>
  ggplot(aes(x = value, color = data_source)) +
  stat_ecdf() +
  facet_wrap(~ variable, scales = "free") +
  scatter_grid()
Figure 5.6: Compare Synthetic and Confidential Distributions with Empirical CDF Plots
NoneExercise 1: Using Utility Metrics (Conceptual)

Consider the following two syntheses of x. Which synthesis do you think has higher utility?

Consider the following two syntheses of x. Which synthesis do you think has higher utility?

set.seed(20230710)
bind_rows(
  synth1 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, mean = 0.2)
  ),
  synth2 = tibble(
    x_conf = rnorm(n = 1000),
    x_synth = rnorm(n = 1000, sd = 0.5)
  ),
  .id = "synthesis"
) |>
  pivot_longer(-synthesis, names_to = "variable") |>
  ggplot(aes(x = value, color = variable)) +
  stat_ecdf() +
  facet_wrap(~ synthesis) +
  scatter_grid()

Consider the following two syntheses of x. Which synthesis do you think has higher utility?

Both syntheses have utility issues; what do you think are the issues?

  • We consider synth1 to be slightly higher utility than synth2 based on the large vertical distances between the lines for synth2.
  • synth1 matches the variance of the confidential data but the mean is a little larger. synth2 matches the mean but has lower variance with fewer observations in the tails of the synthetic data.
WarningWARNING: Marginal vs. joint distribution

A synthetic dataset can do a great job of recreating every univariate or marginal distribution while failing to capture the joint distribution. The rest of these notes prioritize evaluating the relationships between variables.

5.3.2 Bivariate

Many analyses rely on relationships between variables. Reviewing pairwise relationships is challenging because it quickly becomes a high-dimensional problem. If a dataset has \(p\) variables, then there are \(\frac{p(p - 1)}{2}\) pairwise relationships in the data. Here, visualization and numeric summaries are important.

Correlation Fit

TipDefinition: Correlation Fit

Correlation fit measures how well the synthetic dataset recreates the linear relationships between variables in the confidential dataset.

Correlation fit uses the lower triangle of correlation matrices for the synthetic data and confidential data and their difference.

Figure 5.7: Example calculation of correlation fit between synthetic and confidential data.
Three heatmaps illustrating correlation comparison. The first shows pairwise correlations among variables A–D for synthetic data, the second for confidential data, and the third shows the difference between them. For example, the correlation between A and B is 0.75 in synthetic data versus 0.90 in confidential data, yielding a difference of –0.15. Differences are color-coded, highlighting where correlations are well-preserved and where discrepancies exist.

Those differences are often summarized across all variables using L1 or L2 distance. Figure 5.7 shows the creation of a difference matrix. Let’s summarize the difference matrix using mean absolute error (MAE). This will give us a sense of how off the average correlation will be in the synthetic data compared to the confidential data.

\[MAE_{dist} = \frac{1}{n}\sum_{i = 1}^n |dist|\]

\[MAE_{dist} = \frac{1}{6} \left(|-0.15| + |0.01| + |0.1| + |-0.15| + |0.15| + |0.02|\right) \approx 0.0966667\] Note: one can alternatively substitute different correlation measures (for example, rank correlation) or different correlation matrix distances (for example, Frobenius norm).

Relative Mutual Information Fit

Pearson’s correlation coefficient is ever present but is limited to numeric variables. There is less consensus about measures for quantifying relationships between categorical variables. Here, we will use relative mutual information.

TipDefinition: Relative Mutual Information

Relative mutual information (RMI) measures the reduction in entropy (i.e., uncertainty) in one variable when observing another variable. It quantifies the relationship between two variables, generalizes to categorical variables, and is in the interval [0,1].

TipDefinition: Relative Mutual Information Fit

Relative mutual information fit measures how well the synthetic dataset recreates the relative mutual information between variables in the confidential dataset. The element-wise differences are evaluated, and the numeric summaries of the differences are calculated. Higher utility is demonstrated by the difference close to 0.

Let’s walk through a simple example using the Palmer penguins data.

The confidential RMI matrix shows the relationships between the three categorical variables in the confidential data.

        species island  sex
species    1.00   0.50 0.01
island     0.52   1.00 0.01
sex        0.01   0.01 1.00

Synthetic RMI matrix shows the relationships between the three categorical variables in the synthetic data.

        species island sex
species    1.00   0.45   0
island     0.49   1.00   0
sex        0.00   0.00   1

The difference matrix shows that the confidential RMI matrix and synthetic RMI matrix are fairly similar.

        species island   sex
species    0.00  -0.05 -0.01
island    -0.03   0.00 -0.01
sex       -0.01  -0.01  0.00

The mean absolute error (L1 norm) between the two measures is just 0.02.

NoneClass Activity 2: Correlation Difference (Computational)

Consider the following correlation matrices:

[1] "Synthetic"
     [,1] [,2] [,3]
[1,] 1.00  0.5 0.75
[2,] 0.50  1.0 0.80
[3,] 0.75  0.8 1.00
[1] "Confidential"
     [,1] [,2] [,3]
[1,] 1.00 0.35  0.1
[2,] 0.35 1.00  0.9
[3,] 0.10 0.90  1.0
  • Construct the difference matrix
  • Calculate MAE
  • Optional: Calculate RMSE
  • Optional: What is the main difference between MAE and RMSE?

If you do not have access to a computer, describe how you would carry out these calculations. For example, indicate which equation you would use and what values you would input.

[1] "Synthetic"
     [,1] [,2] [,3]
[1,] 1.00  0.5 0.75
[2,] 0.50  1.0 0.80
[3,] 0.75  0.8 1.00
[1] "Confidential"
     [,1] [,2] [,3]
[1,] 1.00 0.35  0.1
[2,] 0.35 1.00  0.9
[3,] 0.10 0.90  1.0
  • Construct the difference matrix
diff <- mat_synth - mat_conf

diff[!lower.tri(diff)] <- NA

diff
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,] 0.15   NA   NA
[3,] 0.65 -0.1   NA
  • Calculate MAE
mean(abs(diff[lower.tri(diff)]))
[1] 0.3
  • Optional: Calculate RMSE
sqrt(mean(diff[lower.tri(diff)] ^ 2))
[1] 0.389444
  • Optional: What is the main difference between MAE and RMSE?

RMSE gives extra weight to large errors because it squares values instead of using absolute values. We like to think of this as the difference between the mean and the median error.

NoneClass Activity 3: Correlation Difference (Computational)

Part 1: Calculate the correlation fit between the synthetic and confidential data. Fill in the blanks and run the code below.

penguins_conf <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv")) |>
  filter(data_source == "confidential")

penguins_synth <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv")) |>
  filter(data_source == "synthetic")

# The cor() function can take in a dataframe and compute correlations 
# between all columns in the dataframe and spit out a correlation matrix
conf_data_corr <- cor(###)
synth_data_corr <- cor(###)

conf_data_corr <- conf_data_corr[lower.tri(conf_data_corr)]
synth_data_corr <- synth_data_corr[lower.tri(synth_data_corr)]
  
correlation_diff <- conf_data_corr - synth_data_corr

# Correlation fit is the sum of the sqrt of the squared differences between each correlation in the difference matrix.
cor_fit <- sum(sqrt( ### ^2))

cor_fit
penguins_conf <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv")) |>
  filter(data_source == "confidential")

penguins_synth <- read_csv(here::here("data", "penguins_synthetic_and_confidential.csv")) |>
  filter(data_source == "synthetic")

# The cor() function can take in a dataframe and compute correlations 
# between all columns in the dataframe and spit out a correlation matrix
conf_data_corr <- cor(select(penguins_conf, where(is.numeric)))
synth_data_corr <- cor(select(penguins_synth, where(is.numeric)))

conf_data_corr <- conf_data_corr[lower.tri(conf_data_corr)]
synth_data_corr <- synth_data_corr[lower.tri(synth_data_corr)]
  
correlation_diff <- conf_data_corr - synth_data_corr

# Correlation fit is the sum of the sqrt of the squared differences between each correlation in the difference matrix.
cor_fit <- sum(sqrt(correlation_diff ^2))

cor_fit
[1] 0.6178178

Part 2: Compare the univariate distributions for mass and height in the confidential and synthetic data using density plots. Fill in the blanks and run the code below.

conf_data <- read_csv(here::here("data/lesson_03_conf_data.csv"))
synth_data <- read_csv(here::here("data/lesson_03_synth_data.csv"))

combined_data <- bind_rows(
  "synthetic" = synth_data, 
  "confidential" = conf_data,
  .id = "type"
)

# Create a density plot of the mass distributions
combined_data %>% 
  ggplot(aes(x = ###,
             fill = type,),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)

# Create a density plot of the height distributions
combined_data %>% 
  ggplot(aes(x = ###,
             fill = type,),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)
conf_data <- read_csv(here::here("data/lesson_03_conf_data.csv"))
synth_data <- read_csv(here::here("data/lesson_03_synth_data.csv"))

combined_data <- bind_rows(
  "synthetic" = synth_data, 
  "confidential" = conf_data,
  .id = "type"
)

# Create a density plot of the mass distributions
combined_data %>% 
  ggplot(aes(x = mass,
             fill = type),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)

# Create a density plot of the height distributions
combined_data %>% 
  ggplot(aes(x = height,
             fill = type),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)

5.3.3 Multivariate

Discriminant Metric Intuition

TipDefinition: Discriminant Based Methods

Discriminant based methods measure how well a predictive model can distinguish (i.e., discriminate) between records from the confidential and synthetic data. Simply put, the harder it is for the predictive model to distinguish records from one another, the higher the general utility of the synthetic data.

  • For sufficiently high utility synthetic data, GSDS and synthetic data should be drawn from similar superpopulations1.
  • The basic idea is to combine (stack) the GSDS and synthetic data and see how well a predictive model distinguishes (i.e., discriminates) between synthetic observations and confidential observations.

  • Poor model performance in distinguishing records indicates high-utility synthesis.

    • It is possible to use logistic regression for the predictive modeling, but optimization-based models like decision trees, random forests, and boosted trees are more common.
  • Discriminant modeling involves:

    1. Training a flexible discriminator model on combined data.
    2. Evaluating model failure on out-of-sample data to assess synthesis quality.
  • General strategies:

    • Use flexible models that generalize well.
    • Train using holdout data excluded from synthesis.
    • Evaluate using metrics that reflect poor model fit including pMSE ratio, SPECKS, and AUC.

Discriminant based models can assess the quality of synthetic data, which essentially tests how well a predictive model can distinguish between synthetic and confidential records. This approach assumes that both datasets are drawn from the same superpopulation, meaning they should reflect similar underlying distributions.

The basic idea is to combine (stack) the confidential data and synthetic data and see how well a predictive model distinguishes (i.e., discriminates) between synthetic observations and confidential observations. If the model struggles to distinguish between the two, this suggests the synthetic data closely mimics the confidential data, indicating a high quality data synthesis.

Modeling Techniques While logistic regression can be used for this binary classification task, more commonly used methods include: decision trees, random forests, and boosted trees. These models are often preferred due to their flexibility and ability to capture complex patterns.

Discriminant model metrics have two stages:

  1. Model Training: Fit a black-box model (known as a discriminator) trained on a subset of combined confidential and synthetic data as to whether each record originated from the confidential or synthetic data (i.e., binary classification).

  2. Model Evaluation: Assess the success (or lack of) with this model on out-of-sample data. If the model performs poorly (i.e., struggles to distinguish between the records), this indicates that the synthetic data are of high quality and closely resembles the confidential data.

General strategies:

  1. Discriminator models should be trained with generalization in mind and should generally be as flexible as possible.
  2. To accommodate discrimator model generalizability assessments, we recommend using holdout data (i.e., data withheld from the synthesis process).
  3. Use model evaluation metrics that assess lack of model fit.

Most discriminant based methods are propensity score based, allowing the method to compare the similarity of two datasets of the same structure of any dimension without making assumptions on the distributions of the attributes. Mathematically, these methods use the following steps. Let \(\mathbf{Y}\) be the confidential dataset with \(n\) observations and \(p\) variables.

  1. Combine the confidential and synthetic datasets, each of size \(n\). Create an indicator variable \(T\) where \(T_i=1\) if record \(i\) is from the synthetic data and \(T_i=0\) otherwise for \(i=1,\ldots, 2n\).
  2. Calculate the propensity score for each record \(i\), \(e_i=\Pr(T_i=1 \mid Y_i)\), through a classification algorithm, with the data attributes as input features.

What is done with the propensity scores next depends on the discriminant based method. Woo et al. (2009a) computes the mean squared error (MSE) of the propensity score against the true proportion of synthetic cases. Snoke et al. (2018b) enhances Woo et al. (2009a)’s approach by computing the average MSE between the propensity scores and the expected probabilities called the propensity score mean squared error (pMSE). Essentially, pMSE normalizes the MSE statistic by its expected null value and standard deviation, helping with its interpretability and differentiating the synthetic dataset apart from the confidential dataset.

Snoke et al. (2018b) also develops the pMSE ratio, which is one of the most popular discriminant based methods. The pMSE ratio is the average pMSE score across all records, divided by the null model, where the null model is the expected value of the pMSE score under the best case scenario when the model used to generate the data reflects the confidential data perfectly. Sakshaug and Raghunathan (2010) discretizes the propensity scores based on how the Chi-squared test is formulated.

Finally, Bowen, Liu, and Su (2021) calculates the eCDFs of the propensity scores of the synthetic and confidential data and then computes the KS (Kolmogorov-Smirnov) distance, a method called SPECKS. In other words, the SPECKS method considers the worst-case separation between the synthetic dataset and the confidential dataset.

What the discriminant based metrics actually measure for assessing the synthetic data quality varies depending on the method and the classification algorithm. For instance, Bowen and Snoke (2021) compares several utility metrics, such as the pMSE ratio and SPECKS, to evaluate differentially private synthetic datasets2 for a data challenge. The authors find that the utility metric algorithms produce mixed results in ranking the best performing differentially private synthetic data method. Conducting a study to analyze what features of the synthetic data are captured by various discriminant based methods using different classification models would be invaluable to the field (Drechsler 2022). However, to the best of our knowledge, no such study exists for synthetic data with and without differential privacy or formal privacy guarantee.

Multivariate and discriminant based methods are high dimensional. To simplify learning, let’s focus on a two-dimensional case in Figure 5.8. In the first panel, the confidential data and synthetic data have the same population parameters. In the second panel, the means differ significantly.

Figure 5.8: A comparison of discriminant metrics on a good synthesis (top) and a poor synthesis (bottom).

Scatter plot comparing confidential (red) and synthetic (blue) data points across variables x and y. The two datasets largely overlap, suggesting they are drawn from the same distribution. Evaluation metrics shown are pMSE ratio = 1.15, SPECKS = 0.0146, and AUC = 0.6, supporting similarity between the confidential and synthetic data.

Scatter plot comparing confidential (red) and synthetic (blue) data points across variables x and y. The two datasets largely do not overlap, suggesting they are not drawn from the same distribution. Evaluation metrics shown are pMSE ratio = 27.78, SPECKS = 0.9997, and AUC = 0.98, supporting that the confidential and synthetic data are different.

Discriminant Metrics Calculation

It is easy to visually compare a small number of dimensions, like in Figure 5.8, but this quickly becomes impossible with more dimensions. pMSE ratio, SPECKS, and AUC help scale up this basic idea.

pMSE ratio, SPECKS, and AUC all require calculating propensity scores (i.e., the probability that a particular data point belongs to the confidential data) and start with the same step. The following two steps summarize the process of adding propensity scores:

  1. Row bind the synthetic and confidential data. Add an indicator variable with 0 for the confidential data and 1 for the synthetic data
species bill_length_mm sex ind
Chinstrap 49.5 male 0
... ... ... ...
Adelie 46.0 male 1
  1. Calculate propensity scores (i.e., probabilities for group membership) for whether a given row belong to the synthetic dataset.
species bill_length_mm sex ind prop_score
Chinstrap 49.5 male 0 0.32
... ... ... ... ...
Adelie 46.0 male 1 0.64

Once the combined data have propensities, pMSE, SPECKS, and ROC AUC Curves are three ways to summarize the propensity scores to evaluate how easy it is to discriminate between the confidential data and synthetic data.

pMSE: Calculates the average Mean Squared Error (MSE) between the propensity scores and the expected probabilities:

  • Proposed by Woo et al. (Woo et al. 2009b) and enhanced by Snoke et al. (Snoke et al. 2018a)

  • After doing steps 1) and 2) above:

    1. Calculate expected probability, i.e., the share of synthetic data in the combined data. In the cases where the synthetic and confidential datasets are of equal size, this will always be 0.5.

      species bill_length_mm sex ind prop_score exp_prob
      Chinstrap 49.5 male 0 0.32 0.5
      ... ... ... ... ... ...
      Adelie 46.0 male 1 0.64 0.5
    2. Calculate pMSE, which is mean squared difference between the propensity scores and expected probabilities. Let N be the number of observations in the combined data.

    \[pMSE = \frac{(0.32 - 0.5)^2 + ... + (0.64-0.5)^2}{N} \]

pMSE ratio: Often people use the pMSE ratio, which is the average pMSE score across all records, divided by the null model (Snoke et al. 2018c).

  • The null model is the expected value of the pMSE score under the best case scenario when the model used to generate the data reflects the confidential data perfectly.

  • pMSE ratio = 1 means that your synthetic data and confidential data are indistinguishable, although values this low are almost never achieved.

SPECKS: Synthetic data generation; Propensity score matching; Empirical Comparison via the Kolmogorov-Smirnov distance.

After generating propensity scores (i.e., steps 1 and 2 from above), you:

  1. Calculate the empirical CDF’s of the propensity scores for the synthetic and confidential data, separately.

  2. Calculate the Kolmogorov-Smirnov (KS) distance between the 2 empirical CDFs. The KS distance is the maximum vertical distance between 2 empirical CDF distributions.

Receiver Operating Characteristic (ROC) curves: Shows the trade off between false positives and true positives. Area under the curve (AUC) is a single number summary of the ROC curve.

AUC is a common tool for evaluating classification models. High values for AUC are bad because they suggest the model can distinguish between confidential and synthetic observations.

After generating propensity scores (i.e., steps 1 and 2 from above),

  • In our context, High AUC = good at discriminating = poor synthesis.

  • In the best case, we want AUC = 0.5 because that means the discriminator is no better than a random guess

pMSE ratio and SPECKS evaluate the distributional similarity without evaluating if the propensities are correct. ROC AUC evaluates the order of the propensities for correctness but does not evaluate if the propensities are well calibrated (e.g., are 90% propensities correct about 90% of the time). In practice, it is good to look at some combination of pMSE ratio, SPECKS, and ROC AUC.

It is useful to look at variable importance for predictive models when observing poor discriminant based metrics. For example, if a discriminant-based metric has a strong ROC AUC, looking at the variable importance can help diagnose which variables are synthesized with low utility.

WarningWARNING: Models can overfit

Many predictive models for generating propensities can memorize chance features of the data used to fit the models. We suggest using a training/testing split and v-fold cross validations for hyperparameter tuning to compare in-sample and out-of-sample propensities and model accuracy.

5.4 Specific Utility

Synthetic data only support analyses captured by the synthesis process. It is informative for data curators and data users to see how well the synthetic data recreate the results from analyses commonly run on the confidential data or similar datasets.

TipDefinition: Specific Utility

Specific utility or analysis-specific utility metrics measure differences between synthetic and confidential data for pre-specified use cases.

  • General utility metrics help data synthesizers validate their methods (i.e., producing overall quality synthetic data); specific utility metrics help downstream users trust the synthetic data (i.e., producing valid synthetic data for certain analyses).

  • It is most useful for data curators to identify important analyses before synthesis by reviewing the literature and working with data users.

Example questions for specific utility:

  • “How well does synthetic data replicate the relationship between GPA and college enrollment found in the confidential data?”
  • “How well does synthetic data produce a confidence or credible interval for a specific model parameter?”

Example metrics include:

  • Differences between pointwise statistics (summary statistics, specific estimates).
  • Model parameter comparisons between confidential and synthetic data (regression coefficients, network node weights, etc.).
  • Confidence or credible interval differences for specific statistics (overlap, signs, significance).

Analysis-specific utility measures the similarity of results between confidential and synthetic datasets for a specific analysis or multiple analyses. These metrics are distinct from general utility metrics, which are primarily useful for data synthesizers to assess the overall quality of the synthetic data generation process. In contrast, specific utility metrics are more relevant to downstream data users who need assurance that the synthetic data can support their intended analyses. Simply put, these metrics assess if data users would reach the same conclusions whether applied to the confidential dataset or synthetic dataset. The specific utility metrics will vary across applications, depending on the common uses of the data.

There are a few ways to compare the synthetic and confidential data outputs. For univariate estimands, such as regression coefficients and means, Karr et al. (2006) creates the confidence interval overlap. This metric is commonly seen in the synthetic data literature, which compares the confidence interval from the confidential and synthetic datasets to see how much the synthetic data generation affects inference. Snoke et al. (2018c) propose a modification that allows for negative confidence interval overlap values that show how far off the confidence intervals do not overlap.

5.4.1 Sign, Significance, Overlap (SSO)

TipDefinition: Sign, Significance, Overlap (SSO)

Sign, Significance, Overlap (SSO) measures how frequently a statistic calculated on the confidential data and synthetic data has the same sign, statistical inference, and overlapping confidence interval.

  • Comparisons mostly focus on direct comparisons of results from the confidential data and synthetic data:
    • Do estimates have the same sign?
    • Do they share the same statistical inference at a common \(\alpha\) level?
    • Do their confidence intervals overlap?
  • These three checks are combined into the SSO match (Sign, Significance, Overlap), which measures the proportion of times all three criteria are met.

5.4.2 Regression confidence interval overlap

TipDefinition: Confidence Interval Overlap (CIO)

Confidence interval overlap (CIO) quantifies how well confidence intervals from estimates on the synthetic data recreate confidence intervals from the confidential data.

1 indicates perfect overlap. 0 indicates intervals that are adjacent but not overlapping. Negative values indicate gaps between the intervals.

A common example is comparing intervals from linear regression models and logistic regression models.

We define the measure as:

\[\begin{equation}\label{eqn:io} CIO = 0.5 \bigg( \frac{min(u_c, u_s) - max(l_c, l_s)}{u_c - l_c} + \frac{min(u_c, u_s) - max(l_c, l_s)}{u_s - l_s} \bigg) \end{equation}\]

where \(u_c\), \(l_c\) and \(u_s\), \(l_s\) are the upper and lower bounds for the confidential and synthetic confidence intervals respectively.

The metric measures how much the confidence intervals estimated the confidential and synthetic data overlap for a single estimate on average, where the maximum value is 1. The value is negative if the intervals do not overlap and grows more negative as they move further away from each other.

Example of confidence interval overlap with “great overlap”, “good overlap”, and “poor overlap”.

WarningWARNING: Limitation of the confidence interval

The interpretability of confidence interval overlap as a utility metric diminishes when disclosure control methods result in excessively wide intervals.

A drawback to the confidence interval overlap measure is the inability to distinguish whether the confidential or the synthetic dataset has a wider confidence intervals that covers the other interval. If one interval is wider but completely encompasses the other interval, the minimum value is 0.5 regardless of the width. This is why Barrientos et al. (2024) created a new metric called sign, significance, and overlap (SSO) match. SSO is the proportion of times that intervals overlap and have the same sign and significance. Although created for evaluating differentially private regression outputs, the SSO can be applied to synthetic data outputs as well.

NoneClass Activity 4: Sign, Significance, and Overlap Match (Conceptual)

Suppose we are interested in the following null and alternative hypotheses:

\[H_0: \mu = 0\]

\[H_a: \mu \ne 0\]

Consider the following output:

[1] "Confidential Mean: 2.7926609008331"
[1] "Confidential Confidence Interval"
[1] 2.308338 3.276984
attr(,"conf.level")
[1] 0.95
[1] "Synthetic Mean: 2.08452909904545"
[1] "Synthetic Confidence Interval"
[1] 1.512416 2.656643
attr(,"conf.level")
[1] 0.95

Do the synthetic data achieve SSO match?

Suppose we are interested in the following null and alternative hypotheses:

\[H_0: \mu = 0\]

\[H_a: \mu \ne 0\]

Consider the following output:

[1] "Confidential Mean: 2.7926609008331"
[1] "Confidential Confidence Interval"
[1] 2.308338 3.276984
attr(,"conf.level")
[1] 0.95
[1] "Synthetic Mean: 2.08452909904545"
[1] "Synthetic Confidence Interval"
[1] 1.512416 2.656643
attr(,"conf.level")
[1] 0.95

Do the synthetic data achieve SSO match?

Yes! The confidence intervals overlap, the signs are the same, and the statistical significance is the same.

5.4.3 Example: Microsimulation results

Figure 5.9: Simplified tax calculator results for the confidential and synthetic data.
NoneClass Activity 5: General vs. Specific (Conceptual)

Are the following metrics examples of general utility or specific utility metrics?

1. Differences in parameter confidence intervals for a prespecified regression model.

2. Distance between empirical cumulative distribution functions for a numeric variable.

3. Difference in the estimate of a treatment effect for a quasi-experiment.

Are the following metrics examples of general utility or specific utility metrics?

1. Differences in parameter confidence intervals for a prespecified regression model. Specific

2. Distance between empirical cumulative distribution functions for a numeric variable. General

3. Difference in the estimate of a treatment effect for a quasi-experiment. Specific

5.5 Case Study: Decennial Census Disclosure Avoidance System

The Disclosure Avoidance System for the 2020 Decennial Census applied differential privacy to statistics before publication to reduce disclosure risks. To demonstrate the new DAS and the TopDown Algorithm used in the DAS, the Census Bureau released a series of demonstration products based on the 2010 Decennial Census and calculated utility metrics.

Based on feedback about the 2010 Demonstration Data Products, data users identified at least four areas of concern:

  1. Accuracy
  2. Bias
  3. Outliers
  4. Impossible or improbable results

5.5.1 Accuracy

  1. Mean Absolute Error (MAE)
  2. Mean Numeric Error (ME)
  3. Root Mean Squared Error (RMSE)
  4. Mean Absolute Percent Error (MAPE)
  5. Coefficient of Variation
  6. Total Absolute Error of Shares (TAES): This measure finds the proportion of each MDF value to the total MDF value for the summary geography and subtracts the proportion of the CEF value to the total CEF value for the summary geography. The absolute value of these proportional differences across evaluation geographies is then summed to the summary geography level. The goal is to provide a measure of the distributional error in the MDF shares.
  7. Percent Difference Thresholds = Count of absolute percent differences above a certain threshold

5.5.2 Bias

  1. Mean Numeric Error (ME)
  2. Mean Percent Error (MALPE)

5.5.3 Outliers and Impossible or Improbable Results

Note: The top-down algorithm included post-processing that ensured that the DAS returned non-negative integers for all counts.

Additionally, certain statistics will be internally examined for “outliers”: What is the largest increase in tabulated value? What is the largest decrease? Is there an inconsistency across the person and unit tables that is impossible or highly improbable? These will inform internal evaluations about the plausibility of tabulated results. Counts of outliers will be made available externally to allow for an assessment of the number of entities with exceptionally large differences between the MDF and the CEF for several of the data metrics tables.

5.5.4 Specific Utility

The Census Bureau also identified specific use cases through a Federal Register Notice, the Committee on National Statistics (CNSTAT) Demonstration Products Workshop, and other outreach.

  • Zero-sum total: “Uses that rely on the accuracy of the distribution in addition to the overall accuracy because a fixed amount of something is being distributed across categories.”
  • Zero-sum category: “Same as zero-sum total except use cases rely on estimates for some subset of the total.”
  • Variable-sum total: “Similar to zero-sum use cases except that the total of what is being distributed can vary.”
  • Variable-sum category: “Same as variable-sum total but for a subset of the population.”
  • Single Year of Age Accuracy: “These use cases require accuracy for single years of age rather than age groups.”
  • Rates Accuracy: “These uses cases rely on a measure of the size of a subgroup(s) within the total population.”
  • Percent Threshold: “Use case depends on the subset of the population crossing a percent threshold.”
  • Numeric Threshold: “Use case depends on the subset of the population crossing a numeric threshold.”

5.5.5 A few themes

  • The only outputs of the DAS are counts of different person and housing characteristics at different levels of geography. The scope of the utility evaluation is still huge.
  • It can be difficult to balance absolute changes and relative changes in utility evaluations.
  • Small denominators create challenges for many relative error metrics.

5.6 Further Considerations and Dimensions of Utility Metrics

5.6.1 Scope and Model Complexity

Comparisons of the confidential data and synthetic data may not capture the full trade-off between privacy and utility. For example, the above utility metrics do not capture the effect of deciding to include or omit a variable from a synthetic dataset. Similarly, they do not capture the effect of coarsening a categorical variable.

  • Scope: what synthetic data components are used?
    • Smaller scope: fewer variables, simpler variable representations.
    • Larger scope: more variables, more complex representations.
  • Model complexity: what statistical properties is the metric trying to capture?
    • Lower complexity: partial descriptions of distribution features.
    • Higher complexity: more complete descriptions of distributions.

5.6.2 Prior knowledge

  • Prior knowledge: what might users know about how data was generated?
    • Weaker knowledge: limited knowledge of synthetic and/or confidential data generating processes.
    • Stronger knowledge: well-informed or exact knowledge of synthetic and/or confidential data generating processes.
TipDefinition: Black-Box vs. White-Box

Black-box synthetic data methods are methods where modeling and sampling parameters are NOT shared with users.

White-box synthetic data methods are methods where modeling and sampling parameters ARE shared with users.

  • Methodological transparency (e.g., sharing modeling choices and code) can be useful, even without sharing parameters.
  • Creating arbitrarily queryable generative models can be equivalent to releasing parameters themselves.

5.6.3 Uncertainty sources and quantification

Until now, we’ve only considered comparing the confidential data with one synthetic dataset created with one sample from the synthesizer. Was this a lucky (or unlucky) sample? How does the uncertainty of the synthesis process align with the specific dataset we observe?

  • Uncertainty sources: what sources of randomness are being quantified?
    • More simple: uncertainty due to model resampling alone.
    • Less simple: uncertainty due to modeling, model resampling, noise, and/or data generating processes.
  • Uncertainty quantification: how are source(s) of randomness being quantified?
    • More simple: using sample-specific or instance-specific empirical distances.
    • Less simple: using distributional descriptions of concentration around population parameters.
TipDefinition: Multiple Synthesis

Multiple synthesis is the process of generating multiple synthetic datasets from the same confidential dataset.

Each synthetic dataset is often called a replicate.

For Data Users

Early papers for data synthesis promoted multiple synthesis. This makes sense because the authors had backgrounds in multiple imputation for missing data analysis. Under this approach, users would:

  1. receive multiple synthetic datasets.
  2. run their analysis on each dataset calculating point estimates and variances.
  3. use variations on Rubin’s rules to combine the within synthetic data variances and between synthetic data variances into standard errors for the estimates that attempt to include the uncertainty from the synthesis process.

This provides transparency for end users but limits synthesis modeling choices, puts a burden on users to work with multiple datasets and to understand complex combination rules, and increases disclosure risks (all else equal) by releasing more synthetic data.

For Data Curators

Data curators may still wish to explore the uncertainty of the synthesis process using multiple synthesis.

WarningWARNING: Utility metric computation time

As utility metrics capture more of the data generating process, they become more computationally intensive to produce.

Strategies for incorporating uncertainty, from easiest to hardest:

  1. Working with one synthetic data replicate.
    • Pros: Computationally feasible.
    • Cons: Metrics vary across replicates.
  2. Working with multiple synthetic data replicates.
    • Pros: Allows for empirical synthesis uncertainty quantification.
    • Cons: Approximates uncertainty quantification due to synthesis.
  3. Working with white-box or transparent synthesis processes.
    • Pros: Allows for theoretically correct synthesis uncertainty quantification.
    • Cons: Typically requires computational tractability for drawing new samples from the synthesis process.
  4. Working with an entire data generating process.
    • Pros: captures end-to-end uncertainty quantification.
    • Cons: usually computationally intractable and requires intensive sampling methods (e.g., Markov Chain Monte Carlo).

5.6.4 Designing Utility Metrics for External Communication

When synthetic data users are given synthetic data and a specific data processing task, they will do one of the following:

  1. Use synthetic data as-is, i.e., without adjustments for the synthesis process.
  2. Use synthetic data with adjustments for the synthesis process.
  3. Decide to not use synthetic data for the task.
TipNote: User-informed metrics

Utility evaluation should, ideally, provide just enough information for synthetic data users to make informed decisions about which option is best for them.

WarningWARNING: Disclosure risks with utility metrics

Sharing utility metrics with users can sometimes increase disclosure risks; to alleviate this tension, we recommend…

  1. Demonstrating utility metrics on publicly available data products.
  2. Providing utility metrics that only depend on data-independent modeling choices, such as noise distribution definitions.
  3. Selectively evaluating and disseminating utility metrics for public use, while keeping most utility evaluations restricted to data curators.
  4. Providing query-based alternatives to synthetic data, such as verification and validation servers.
NoneClass Activity 6: General vs. Specific (Conceptual)
Figure 5.10: Different modeling trade-offs with the confidential data, low privacy synthetic data, medium privacy synthetic data, and high privacy synthetic data.

Above are three different toy bivariate synthetic datasets and a confidential dataset. A hypothetical user asks about whether they can use this toy synthetic data for a linear regression. Based on these comparisons, what user recommendations would you make for LowPrivacy, MedPrivacy, and HighPrivacy?

Above are three different toy bivariate synthetic datasets and a confidential dataset. A hypothetical user asks about whether they can use this toy synthetic data for a linear regression. Based on these comparisons, what user recommendations would you make for LowPrivacy, MedPrivacy, and HighPrivacy?

  • Low Privacy -> #1 synthetic data as-is
  • Medium Privacy -> #2 synthetic data with adjustment
  • High Privacy -> #3 don’t use synthetic data

  1. A superpopulation is a theoretical infinite population that represents all observations that could have ever existed. It is useful for thinking about parameters and uncertainty when working with population data instead of sample data.↩︎

  2. Differentially private synthetic data are synthetic data that satisfies the definition of differential privacy, which quantifies the disclosure risk in formal ways. To learn more about differential privacy and formal privacy, we refer interested readers to Williams and Bowen (2023).↩︎