Appendix B — tidysynthesis
Security Principles
The following document outlines the security principles that the Urban Institute uses when developing and maintaining the tidysynthesis
R
package. tidysynthesis
helps users specify synthetic data models and generates samples from those models in R
. However, these modeling and sampling steps are a small part of the broader data sharing picture. The following guide outlines assumptions and best practices that we recommend when using tidysynthesis
as part of a data publishing system.
tidysynthesis
package users tasked with synthesizing confidential data should have sufficient permissions to access and use confidential data.
While this principle may seem obvious, it is worth explicitly stating that tidysynthesis
requires its user to have access to the underlying confidential data. The software is not designed for use in an arbitrary queryable setting, i.e., arbitrary users without access to the confidential data should not be able to execute arbitary tidysynthesis
workflows on confidential data.
This is a conscious choice on the part of tidysynthesis
developers. By enabling the development of flexible modeling workflows, such as those using library(recipes)
, library(parsnip)
, and other highly extensible tidymodels
packages, tidysynthesis
has countless endpoints for extracting confidential data.
Any start_data
supplied to tidysynthesis
should be treated as publicly available information to anyone with access to synthetic data.
tidysynthesis
uses sequential conditional modeling beginning from a selection of starting data; typically, these starting data represent a subset of variables (i.e., confidential data columns) and an (often randomized or synthesized) subset of rows (i.e., confidential data entries). None of the disclosure risks protections afforded by synthetic data apply to the starting data, since they are left as-is in the synthesis output. Moreover, there could be significant disclosure risks if one can use the starting data to effectively infer confidential data membership or synthesized attributes. Therefore starting data should be handled with care, going through the same disclosure review processes as any result derived from confidential data.
In particular, we do not recommend using exact copies of select confidential data columns as your start_data
in tidysynthesis
(sometimes known as partially synthetic data as opposed to fully synthetic data). Not only are these data values potentially highly disclosive, but the start_data
row ordering, row names, and other contextual formatting information can additionally increase disclosure risk.
Any descriptors of how synthetic data was generated, such as descriptions of data generating models or open-source code, should be subject to additional disclosure review before releasing to recipients without access to the original confidential data.
Releasing code, documentation, or other materials that describe synthetic data generation processes can pose additional disclosure risk. Below we describe a few of the key vulnerabilities associated with synthetic data methodology.
Mathematically, model descriptions and parameters can themselves be disclosive if they encode sensitive information about the confidential data. For example, saturated log-linear models for contingency tables have parameters that indirectly encode the exact proportion of records with certain categorical features. Note that some data synthesis methodologies enable methodological transparency by posing no additional disclosure risks, such as differential privacy (DP). However, because tidysynthesis
is designed to work with a broad slate of data synthesis methodologies, including but not limited to DP, tidysynthesis
offers no such transparency guarantees.
Similar issues arise when sharing code. In particular, the psuedo-random nature of typical samples generated from R
implies that random number generator seeds act as keys to deterministically reproduce any potential randomness from sampling synthetic data models. Additionally, vulnerabilities in treating real-valued continuous numbers as discrete floating point approximations poses additional security vulnerabilities.
Any evaluations of synthetic data utility or privacy risk should be subject to additional disclosure review before releasing to synthetic data recipients without access to the original confidential data.
Just as descriptions of how synthetic data were generated can pose disclosure risk, evaluation metrics that rely on confidential data additionally pose disclosure risks. For example, comparisons of summary quantiles might indicate the greatest possible distance between maximum and minimum values for variables in the confidential and synthetic data.
Additionally, it’s critical to note that empirical disclosure risk metrics do not provide inherent guarantees about the protections afforded by synthetic data. Empirical risk metrics, like those implemented in syntheval
, only describe what is possible to infer about individuals in the confidential data under their respective adversarial and data generating assumptions (i.e., what is known in advance about the confidential data and the synthetic data modeling and sampling processes).
Disclosure risk avoidance is a spectrum, and every non-trivial output from tidysynthesis
necessarily poses some disclosure risks.
tidysynthesis
allows users to tune their data synthesis models to strike the right balance between fitness-for-use and disclosure risk. While there are numerous methods for measuring both utility and risk, no synthetic data method guarantees entirely eliminates this risk. The only guaranteed way to eliminate disclosure risks is to never release any results derived from the confidential data.
To further alleviate these unanticipated risks, we recommend pairing tidysynthesis
outputs with other data access controls, as is common for public-use microdata. Here are some examples:
- Tiered access levels allow users to access increasingly granular and sensitive data in increasingly restrictive settings. For example, the All of Us Research Hub provides access to NIH genetic data in three tiers that restrict data access and use while increasing granularity (as of August 2024).
- Validation servers allow users restricted abilities to compare results generated on synthetic data and confidential data to assess their validity. For example, the Safe Data Technologies project uses a validation server for validating regressions based on IRS income data (as of August 2024).