Introduction to Data Privacy

Why is Data Privacy important?

Modern computing and technology has made it easy to collect and process large amounts of data quickly.
But, with that computing power, malicious actors can now easily reidentify individuals by linking supposedly anonymized records with public databases.
This kind of attack is called a “record linkage” attack. The following are some examples of famous record linkage attacks.
- In 1997, MA Gov. Bill Weld announced the public release of insurance data for researchers. He assured the public that PII had been deleted. A few days later Dr. Latanya Sweeney, then a MIT graduate student, mailed to Weld’s office his personal medical information. She purchased voter data and linked Weld’s birth date, gender and zip code to his health records. And this was back in 1997, when computing power was miniscule, and social media didn’t exist!
- A study by Dr. Latayna Sweeney based on the 1990 Census (Sweeney 2000) found that the 87% of the US population had reported characteristics that likely made them unique based only on ZIP, gender, and date of birth.
At the same time, releasing granular data can be of immense value to researchers. For example, cell phone data are invaluable for emergency responses to natural disasters, and granular medical data will lead to better treatment and development of cures.

More granular data are also important for understanding equity, particularly for smaller populations and subgroups.

What is Data Privacy?

Data Privacy: the right of individuals to have have control over their sensitive information.
- There are differing notions of what should and shouldn’t be private, which may include being able to opt out of privacy protections.
Data Privacy is a broad topic, which includes data security, encryption, access to data, etc.
- We will not be covering privacy breaches from unauthorized access to a database (e.g., hackers).
- We are instead focused on privacy preserving access to data.
Although data privacy and data confidentiality are certainly related, they are different, and both play a role in limiting statistical disclosure risk.
- Privacy: the ability “to determine what information about ourselves we will share with others” (Fellegi 1972).
- Confidentiality: “the agreement, explicit or implicit, between data subject and data collector regarding the extent to which access by others to personal information is allowed” (Fienberg and Jin 2018).
There is often a tension between privacy and data utility (or usefulness). This tension is referred to in the data privacy literature as the “privacy-utility trade-off.”
- For example, some universities require students to install an app that tracks their movements on campus. This allows professors teaching large classes with 100+ students to know their students’ punctuality, tardiness, or class absences. This tracking can be invasive, especially for students who rarely leave campus except during holidays, because the university now has a comprehensive record of their daily lives. However, the tracking app could alert students about an active shooter on campus, identify safe buildings to seek refuge, and notify emergency contacts regarding the students’ safety.
- Data utility, quality, accuracy, or usefulness: how practically useful or accurate to the data are for research and analysis purposes.
- Generally higher utility = more privacy risks and vice versa.
In the data privacy ecosystem there are the following stakeholders:
- Data users and practitioners: individuals who consume the data, such as analysts, researchers, planners, and decision-makers.
- Data privacy experts or researchers: individuals who specialize in developing data privacy and confidentiality methods.
- Data curators, maintainers, or stewards: (aka data stewards/data owners): individuals who own the data and are responsible for its safekeeping.
- Data intruders, attackers, or adversaries: individuals who try to gather sensitive information from the confidential data.
In addition, there are many version of the data we should define:
- Original dataset: the uncleaned, unprotected version of the data, such as the raw census microdata, which are never publicly released.
- Confidential or gold standard dataset: the cleaned version (meaning edited for inaccuracies or inconsistencies) of the data; often referred to as the gold standard or actual data for analysis. For example, the Census Edited File that is the final confidential data for the 2020 Census. This dataset is never publicly released but may be made available to others who are sworn to protect confidentiality and who are provided access in a secure environment, such as a Federal Statistical Research Data Center.
- Public dataset: the publicly released version of the confidential data, such as the US Census Bureau’s public tables and datasets.

Data Privacy Workflow

Data users have traditionally gained access to data via:
1. direct access to the confidential data if they are trusted users (e.g., obtaining Special Sworn Status to use the Federal Statistical Research Data Centers).
2. Access to public data or statistics, such as public microdata and summary tables, that the data curators and privacy experts produced with modification to protect confidentiality.
The latter is how most data users gain access to information from confidential data and what we will focus on for this course. To create public data or statistics, data curators rely on statistical disclosure control (SDC) or limitation (SDL) methods to preserve data confidentiality. The process of releasing this information publicly often involves the steps shown in figure 4.
Generally releasing confidential data involves the following steps to protect privacy:

Overview of SDC

Statistical Disclosure Control (SDC) or Statistical Disclosure Limitation (SDL) is a field of study that aims to develop methods for releasing high-quality data products while preserving data confidentiality as a means of maintaining privacy.
SDC methods have existed within statistics and the social sciences since the mid-twentieth century.
Below is an opinionated, and incomplete overview of various SDC methods. For this set of training sessions, we will focus in-depth on the methods in yellow.

Definitions of a few methods we won’t cover in detail. See Matthews and Harel (2011) for more information.
- Suppression: Not releasing data about certain subgroups.
- Swapping: The exchange of sensitive values among sample units with similar characteristics.
- Generalization: Aggregating variables into larger units (e.g., reporting state rather than zip code) or top/bottom coding (limiting values below/above a threshold to the threshold value).
- Noise Infusion: Adding random noise, often to continuous variables which can maintain univariate distributions.
- Sampling: Only releasing a sample of individual records.
Problem with the above approaches is that it really limits the utility of the data.
- Mitra and Reiter (2006) found that a 5 percent swapping of 2 identifying variables in the 1987 Survey of Youth in Custody invalidated statistical hypothesis tests in regression.
- Top/bottom coding eliminates information at the tails of the distributions, degrading analyses that depend on the entire distribution (Fuller 1993; Reiter, Wang, and Zhang, 2014).
A newer development in the SDC field with the potential to overcome some of these problems is Synthetic Data, which we will discuss further in the second half of this session.

Measuring Utility Metrics and Disclosure Risks

Disclosure Risks

Generally there are 3 kinds of disclosure risk:
1. Identity disclosure risk: occurs if the data intruder associates a known individual with a public data record (e.g., a record linkage attack or when a data adversary combines one or more external data sources to identify individuals in the public data).
2. Attribute disclosure risk: occurs if the data intruder determines new characteristics (or attributes) of an individual based on the information available through public data or statistics (e.g., if a dataset shows that all people age 50 or older in a city are on Medicaid, then the data adversary knows that any person in that city above age 50 is on Medicaid).
3. Inferential disclosure risk: occurs if the data intruder predicts the value of some characteristic from an individual more accurately with the public data or statistic than would otherwise have been possible (e.g., if a public homeownership dataset reports a high correlation between the purchase price of a home and family income, a data adversary could infer another person’s income based on purchase price listed on Redfin or Zillow).
Important note: acceptable disclosure risks are usually determined by law.

Utility Measures

Generally there are 2 ways to measure utility of the data:
1. General Utility (aka global utilty): measures the univariate and multivariate distributional similarity between the confidential data and the public data (e.g., sample means, sample variances, and the variance-covariance matrix).
2. Specific Utility (aka outcome specific utility): measures the similarity of results for a specific analysis (or analyses) of the confidential and public data (e.g., comparing the coefficients in regression models).
Higher utility = higher accuracy and usefulness of the data, so this is a key part of selecting an appropriate SDC method.

Exercise 1

Let’s say a researcher generates a synthetic version of a dataset on penguins species. The first 5 rows of the gold standard dataset looks like this:


species	bill_length_mm	sex
Chinstrap	51.3	male
Gentoo	44.0	female
Chinstrap	51.4	male
Chinstrap	45.4	female
Adelie	36.2	female

One of the metrics to assess data utility was the overall counts of penguin species across the synthetic and gold standard data, which look like this:


species	# conf. data	# synthetic data
Adelie	152	138
Chinstrap	68	68
Gentoo	124	116

Question 1: Would the above counts be considered a global utility metric or a specific utility metric and why?

Question 2: Researchers (Mayer, Mutchler, and Mitchell 2016) looked at telephone metadata, which included times, duration and outgoing numbers of telephone calls. They found that 1 of the records in the data placed frequent calls to a local firearm dealer that prominently advertises a specialty in the AR semiautomatic rifle platform. The participant also placed lengthy calls to the customer support hotline for a major firearm manufacturer which produces a popular AR line of rifles. Using publicly available data, they were able to confirm the participants identity and confirm that he owned an AR-15. In this example what kinds of disclosures happened? (Hint: there were two!)

Synthetic Data

Synthetic data consists of pseudo or “fake” records that are statistically representative of the confidential data. Records are considered synthesized when they are replaced with draws from a model fitted to the confidential data.

The goal of most synthesis is to closely mimic the underlying distribution and statistical properties of the real data to preserve data utility while minimizing disclosure risks.
Synthesized values also limit an intruder’s confidence, because they cannot confirm a synthetic value exists in the confidential dataset.
Synthetic data may be used as a “training dataset” to develop programs to run on confidential data via a validation server.

Partially synthetic data only synthesizes some of the variables in the released data (generally those most sensitive to disclosure). In partially synthetic data, there remains a one-to-one mapping between confidential records and synthetic records. Below, we see an example of what a partially synthesized version of the above confidential data could look like.

Fully synthetic data synthesizes all values in the dataset with imputed amounts. Fully synthetic data no longer directly map onto the confidential records, but remain statistically representative. Since fully synthetic data does not contain any actual observations, it protects against both attribute and identity disclosure. Below, we see an example of what a fully synthesized version of the confidential data shown above could look like.

Synthetic Data <-> Imputation Connection

Multiple imputation was originally developed to address non-response problems in surveys (Rubin 1977).
Statisticians created new observations or values to replace the missing data by developing a model based on other available respondent information.
This process of replacing missing data with substituted values is called imputation.

Imputation Example

Imagine you are running a conference with 80 attendees. You are collecting names and ages of all your attendees. Unfortunately, when the conference is over, you realize that only about half of the attendees listed their ages. One common imputation technique is to just replace the missing values with the mean age of those in the data.

Shown below is the distribution of the 40 age observations that are not missing.

And after imputation, the histogram looks like this:

Using the mean to impute the missing ages removes useful variation and conceals information from the “tails” of the distribution.
Simply put, we used a straightforward model (replace the data with the mean) and sampled from that model to fill in the missing values.
When creating synthetic data, this process is repeated for an entire variable, or set of variables.
In a sense, the entire column is treated as missing!

Sequential Synthesis

A more advanced implementation of synthetic data generation estimates models for each predictor with previously synthesized variables used as predictors. This iterative process is called sequential synthesis. This allows us to easily model multivariate relationships (or joint distributions) without being computationally expensive.

The process described above may be easier to understand with the following table:


Step	Outcome	Modelled with	Predicted with
1	Sex	—	Random sampling with replacement
2	Age	Sex	Sampled Sex
3	Social Security Benefits	Sex, Age	Sampled Sex, Sampled Age
—	—	—	—

We can select the synthesis order based on the priority of the variables or the relationships between them.
The earlier in the order a variable is synthesized, the better the original information is preserved in the synthetic data usually.
(Bowen, Liu, and Su 2021) proposed a method that ranks variable importance by either practical or statistical utility and sequentially synthesizes the data accordingly.

Parametric vs. Nonparametric Data Generation Process

Parametric data synthesis is the process of data generation based on a parametric distribution or generative model.

Parametric models assume a finite number of parameters that capture the complexity of the data.
They are generally less flexible, but more interpretable than nonparametric models.
Examples: regression to assign an age variable, sampling from a probability distribution, Bayesian models, copula based models.

Nonparametric data synthesis is the process of data generation that is not based on assumptions about an underlying distribution or model.

Often, nonparametric methods use frequency proportions or marginal probabilities as weights for some type of sampling scheme.
They are generally more flexible, but less interpretable than parametric models.
Examples: assigning gender based on underlying proportions, CART (Classification and Regression Trees) models, RNN models, etc.

Important: Synthetic data are only as good as the models used for imputation!

Implicates

Researchers can create any number of versions of a partially synthetic or fully synthetic dataset. Each version of the dataset is called an implicate. These can also be referred to as replicates or simply “synthetic datasets”
- For partially synthetic data, non-synthesized variables are the same across each version of the dataset.
Multiple implicates are useful for understanding the uncertainty added by imputation and are required for calculating valid standard errors.
More than one implicate can be released for public use; each new release, however, increases disclosure risk (but allows for more complete analysis and better inferences, provided users use the correct combining rules).
Implicates can also be analyzed internally to find which version(s) of the dataset provide the most utility in terms of data quality.

Exercise 2

Sequential Synthesis

Question 1

You have a confidential dataset that contains information about dogs’ weight and their height. You decide to sequentially synthesize these two variables and write up your method below. Can you spot the mistake in writing up your method?

To create a synthetic record, first synthetic pet weight is assigned based on a random draw from a normal distribution with mean equal to the average of confidential weights, and standard deviation equal to the standard deviation of confidential weights. Then the confidential height is regressed on the synthetic weight. Using the resulting regression coefficients, a synthetic height variable is generated for each row in the data using just the synthetic weight values as an input.

Question 1 Notes

To create a synthetic record, first synthetic pet weight is assigned based on a random draw from a normal distribution with mean equal to the average of confidential weights, and standard deviation equal to the standard deviation of confidential weights. Then the confidential height is regressed on the synthetic weight. Using the resulting regression coefficients, a synthetic height variable is generated for each row in the data using just the synthetic weight values as an input.

Height should be regressed on the confidential values for weight, rather than the synthetic values for weight

References

Bowen, Claire McKay, Fang Liu, and Bingyue Su. 2021. “Differentially Private Data Release via Statistical Election to Partition Sequentially.” Metron 79 (1): 1–31.

Fellegi, Ivan P. 1972. “On the Question of Statistical Confidentiality.” Journal of the American Statistical Association 67 (337): 7–18.

Fienberg, Stephen E, and Jiashun Jin. 2018. “Statistical Disclosure Limitation for~ Data~ Access.” In Encyclopedia of Database Systems (2nd Ed.).

Matthews, Gregory J, and Ofer Harel. 2011. “Data Confidentiality: A Review of Methods for Statistical Disclosure Limitation and Methods for Assessing Privacy.” Statistics Surveys 5: 1–29.

Mayer, Jonathan, Patrick Mutchler, and John C Mitchell. 2016. “Evaluating the Privacy Properties of Telephone Metadata.” Proceedings of the National Academy of Sciences 113 (20): 5536–41.

Rubin, Donald B. 1977. “Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys.” Journal of the American Statistical Association 72 (359): 538–43.

Sweeney, Latanya. 2000. “Simple Demographics Often Identify People Uniquely.” Health (San Francisco) 671 (2000): 1–34.

Lesson 1: Introduction to Data Privacy and Synthetic Data

June 27, 2023

Introduction to Data Privacy

Why is Data Privacy important?

What is Data Privacy?

Data Privacy Workflow

Overview of SDC

Measuring Utility Metrics and Disclosure Risks

Disclosure Risks

Utility Measures

Exercise 1

Synthetic Data

Synthetic Data <-> Imputation Connection

Imputation Example

Sequential Synthesis

Parametric vs. Nonparametric Data Generation Process

Implicates

Exercise 2

Sequential Synthesis

Question 1

Question 1 Notes

Suggested Reading - General Data Privacy

Suggested Reading - Synthetic Data

References