options(scipen = 999)
library(tidyverse)
library(urbnthemes)
library(smoothmest)
set_urbn_defaults()
Differential privacy uses the concept of a privacy loss budget, typically represented mathematically as \(\epsilon\). The privacy loss budget bounds the privacy risk associated with releasing data or query results (Census Bureau 2022).
(Note: \(\epsilon\) is not the only privacy loss parameter, but we will use it here as a general representation of the privacy loss budget.)
Differential Privacy (Dwork, McSherry, et al. 2006): A sanitization algorithm, \(\mathcal{M}\), satisfies \(\epsilon\)-DP if for all subsets \(S\subseteq Range(\mathcal{M})\) and for all \(X,X'\) such that \(d(X,X')=1\),
\[\frac{\Pr(\mathcal{M}( X) \in S)}{ \Pr(\mathcal{M}( X')\in S)}\le \exp(\epsilon)\]
where \(\epsilon>0\) is the privacy loss budget and \(d(X,X')=1\) represents the possible ways that \(X'\) differs from \(X\) by one record.
Global Sensitivity is a term which describes how resistant the differentially private sanitizer is to the presence of outliers (Bowen and Garfinkel 2021). It is quantified by how much an output can change by the addition or removal of the most extreme possible record that could exist in the population (regardless of whether that record is actually present in the data).
\(l_1\)-Global Sensitivity (Dwork, McSherry, et al. 2006): The maximum amount a statistic can change in absolute value terms with the addition or removal of the most extreme possible observation.
\(l_1\)-Global Sensitivity (Dwork, McSherry, et al. 2006): For all \(X,X'\) such that \(d(X,X')=1\), the global sensitivity of a function \(M\) is
\[\Delta_1 (M)= \underset{d(X,X')=1}{\text{sup}} \|M(X)-M(X') \|_1\]
For scalars, the \(l_1\)-Global Sensitivity is \(|M(X) - M(X')|\).
\(l_2\)-Global Sensitivity (Dwork, McSherry, et al. 2006): The maximum amount a statistic can change with the addition or removal of the most extreme possible observation. In this case, the statistic is squared, summed, and rooted.
\(l_2\)-Global Sensitivity (Dwork, McSherry, et al. 2006): For all \(X,X'\) such that \(d(X,X')=1\), the global sensitivity of a function \(M\) is
\[\Delta_2 (M)= \underset{d(X,X')=1}{\text{sup}} \|M(X)-M(X') \|_2\]
For scalars, the \(l_2\)-Global Sensitivity is \(\sqrt{(M(X) - M(X'))^2}\).
Suppose we are interested in counting the number of sole proprietorships in Washington, DC. What are the \(l_1\) and \(l_2\) global sensitivities of this statistic?
In other words, what is the maximum difference between \(M(X)\) and \(M(X')\) when \(d(X,X')=1\)?
The answer is one. The most a count can change by adding or subtracting one observation is one!
\(\Delta_1 (M) = \Delta_2 (M) = 1\)
Suppose we are interested in calculating the total income of sole proprietorships in Washington, DC. What are the \(l_1\) and \(l_2\) global sensitivities of this statistic?
In other words, what is the maximum difference between \(M(X)\) and \(M(X')\) when \(d(X,X')=1\)?
The answer is \(\infty\). A total can theoretically change by any amount with the addition or deletion of one observation.
Counts are the best explored statistics in differential privacy. With unbounded differential privacy, the global sensitivity of a count is always 1.
For example, assume we are counting the number of billionaires in the United States. The most the count can change with the addition or removal of Jeff Bezos is one.
Calculating the global sensitivity is more difficult for other statistics than counts. The global sensitivity of a sum is unbounded because the addition or removal of one row can theoretically change the sum by any amount.
One approach is to clip or truncate values. If we assume that all observations are between 6 and 10, inclusive, then we can treat the global sensitivity as \(10 - 6 = 4\).
Means can be rewritten as two queries: a total divided by a count.
Sometimes the number of observations is assumed to be known. In this case, the global sensitivity is smaller.
A sanitizer protects against disclosure risk. A differentially private sanitizer protects against disclosure risk and meets the definition of differential privacy. If we know the global sensitivity of statistics, then we can often add noise in a way that sanitizers satisfy differential privacy.
The Laplace sanitizer satisfies \(\epsilon\)-DP by adding noise from a Laplace distribution to statistics. More sensitivity means more expected noise. More \(\epsilon\) means less expected noise.
Laplace Mechanism (Dwork, McSherry, et al. 2006): The Laplace Mechanism satisfies \(\epsilon\)-DP by adding noise \((\eta_1,\ldots,\eta_k)\) to \(M\) that are independently drawn from a Laplace distribution with the location parameter at 0 and scale parameter of \(\frac{\Delta_1(M)}{\epsilon}\) such that
\[\mathcal{M}(X)=M(X)+(\eta_1,\ldots,\eta_k)\]
The Laplace sanitizer uses \(l_1\)-Global Sensitivity.
Let’s consider a simple example with the Palmer Penguins data set. The data set contains 333 observations about Adelie, Chinstrap, and Gentoo penguins in Antarctica. Suppose we want to count how many penguins are Adelie penguins.
penguins <- palmerpenguins::penguins %>%
drop_na()
penguins
## # A tibble: 333 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # i 323 more rows
## # i 2 more variables: sex <fct>, year <int>
The global sensitivity is \(\frac{\Delta_1(M)}{\epsilon} = \frac{1}{\epsilon}\). This means our differentially private statistic is one draw from a Laplace distribution with center at the confidential statistics and scale parameter equal to \(\frac{1}{\epsilon}\).
We’ll use the laplace_sanitizer()
function from week
4:
# function to draw Laplace noise for one statistic
laplace_sanitizer <- function(sensitivity, epsilon, n = 1) {
# lambda (distribution width) is sensitivity/privacy loss budget
l <- sensitivity / epsilon
# draw from Laplace distribution
noise <- smoothmest::rdoublex(n = n, # draw one observation (adding noise to one statistic)
mu = 0, # centered at 0
lambda = l) # scale based on l calculated above
return(noise)
}
Let’s calculate the statistic without any noise.
answer_conf <- sum(penguins$species == "Adelie")
answer_conf
## [1] 146
Now, let’s calculate the statistic with noise that satisfies the definition of \(\epsilon\)-differential privacy.
set.seed(1)
answer_dp <- answer_conf + laplace_sanitizer(sensitivity = 1, epsilon = 0.1)
answer_dp
## [1] 153.5518
Maybe we got a lucky or unlucky draw from the Laplace distribution. Let’s calculate this statistic 100 times to understand the distribution of noisy statistics. This is purely for demonstration to understand the expectation of the noisy statistic.
The Gaussian sanitizer satisfies \(\epsilon\)-DP by adding noise from a Gaussain distribution (also known as Normal distribution or bell curve) to statistics. More sensitivity means more expected noise. More \(\epsilon\) means less expected noise.
Gaussian Mechanism (Dwork and Roth 2014): The Gaussian Mechanism satisfies \((\epsilon,\delta)\)-DP by adding Gaussian noise with zero mean and variance, \(\sigma^2\), such that
\[\mathcal{M}(X)=M(X)+(\eta_1,\ldots,\eta_k)\]
where \(\eta_1,\ldots,\eta_k\) are independently drawn and \(\sigma=\frac{\Delta_2(M)\sqrt{2 \log(1.25/\delta)}}{\epsilon}\).
This sanitizer includes two parameters: \(\epsilon\) and \(\delta\). We can think of \(\delta\) as a small probability that the bound created by \(\epsilon\) does not hold.
The Gaussian sanitizer uses \(l_2\)-Global Sensitivity.
# function to draw Laplace noise for one statistic
gaussian_sanitizer <- function(sensitivity, epsilon, delta) {
# lambda (distribution width) is sensitivity/privacy loss budget
sigma <- (sensitivity * sqrt(2 * log(1.25 / delta))) / epsilon
# draw from Laplace distribution
noise <- rnorm(n = 1, # draw one observation (adding noise to one statistic)
mean = 0,
sd = sigma) # scale based on l calculated above
return(noise)
}
Let’s calculate the statistic without any noise.
answer_conf <- sum(penguins$species == "Adelie")
answer_conf
## [1] 146
Now, let’s calculate the statistic with noise that satisfies the definition of \((\epsilon, \delta)\)-differential privacy.
set.seed(1)
answer_dp <- answer_conf + gaussian_sanitizer(sensitivity = 1, epsilon = 0.1, delta = 10^-7)
answer_dp
## [1] 110.1865
Maybe we got a lucky or unlucky draw from the Normal distribution. Let’s calculate this statistic 100 times to understand the distribution of noisy statistics. This is purely for demonstration to understand the expectation of the noisy statistic.
The Gaussian sanitizer is worse than the Laplace sanitizer! So why do we even need a Gaussian sanitizer?
The Gaussian sanitizer can compose better for multiple queries. This is because the sum of two normally distributed random variables is normally distributed but the sum of two Laplacian distributed variables is not Laplacian.
Exponential Mechanism (F. McSherry and Talwar 2007): The Exponential Mechanism releases values with a probability proportional to
\[\left(\frac{\epsilon u(X, \theta)}{2\Delta_1(u)}\right)\]
and satisfies \(\epsilon\)-DP, where \(u(X,\theta)\) is the score or quality function that determines the values for each possible output, \(\theta\), on \(X\).
The Exponential sanitizer uses \(l_1\)-Global Sensitivity.
The sequential composition theorem (F. D. McSherry 2009; Dwork and Rothblum 2016; Bun and Steinke 2016) allows us to calculate multiple noisy statistics under the same privacy budget.
Suppose a mechanism, \(\mathcal{M}_j\), provides \(\epsilon_j\)-DP. The sequence of \(\mathcal{M}_j(X)\) applied on the same \(X\) provides \(\sum_{j=1}^J\epsilon_j\)-DP.
Let’s return the penguins example from above. Suppose \(\epsilon = 1\) and we want to count the number of “Adelie” penguins and the number of “Chinstrap” penguins.
epsilon <- 1
set.seed(20220505)
sum(penguins$species == "Adelie") + laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 2)
## [1] 146.7052
sum(penguins$species == "Chinstrap") + laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 2)
## [1] 78.36747
For reference, let’s look at the truth.
sum(penguins$species == "Adelie")
## [1] 146
sum(penguins$species == "Chinstrap")
## [1] 68
The parallel composition theorem (F. D. McSherry 2009; Dwork and Rothblum 2016; Bun and Steinke 2016) allows us to conserve \(\epsilon\) (or increase accuracy) when statistics come from partitions of the data (for example, different states).
Let \(D_j\) be disjoint subsets of the input domain \(D\). The sequence of \(\mathcal{M}_j(X\cap D_j)\) provides \(\max_{j \in \{1,\ldots,J\}} \epsilon_j\)-DP
Let’s consider a larger data set with 53,940 observations about diamonds. Suppose we want to calculate a differenitally private histogram of diamond sizes with bins [0, 1], (1, 2], (2, 3], (3, 4], (4, 5], and (5,6] with \(\epsilon = 0.1\).
diamonds_conf <- count(diamonds, carat = ceiling(carat))
diamonds_conf
## # A tibble: 6 x 2
## carat n
## <dbl> <int>
## 1 1 36438
## 2 2 15613
## 3 3 1857
## 4 4 27
## 5 5 4
## 6 6 1
One approach is to use \(\frac{\epsilon = 0.1}{6}\) for each of the six counting queries. This is based on sequential composition.
epsilon <- 0.1
set.seed(10)
diamonds_conf <- bind_cols(
diamonds_conf,
sequential = diamonds_conf$n + laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 6, n = 6)
)
diamonds_conf
## # A tibble: 6 x 3
## carat n sequential
## <dbl> <int> <dbl>
## 1 1 36438 36437.
## 2 2 15613 15558.
## 3 3 1857 1902.
## 4 4 27 -67.5
## 5 5 4 17.9
## 6 6 1 66.2
The bins for carat
partition the data set and each bin
is a disjoint subset of the data. Therefore, we can use parallel
composition and get more accurate differentially private counts!
set.seed(11)
diamonds_conf <- bind_cols(
diamonds_conf,
parallel = diamonds_conf$n + laplace_sanitizer(sensitivity = 1, epsilon = epsilon, n = 6)
)
diamonds_conf
## # A tibble: 6 x 4
## carat n sequential parallel
## <dbl> <int> <dbl> <dbl>
## 1 1 36438 36437. 36446.
## 2 2 15613 15558. 15683.
## 3 3 1857 1902. 1857.
## 4 4 27 -67.5 69.0
## 5 5 4 17.9 28.6
## 6 6 1 66.2 9.53
The Post-Processing Theorem (Bun and Steinke 2016; Dwork, McSherry, et al. 2006; Nissim, Raskhodnikova, and Smith 2007) allows us to use differentially private information in any way we see fit without releasing further information.
If \(\mathcal{M}\) is a sanitizer that satisfies \(\epsilon\)-DP and \(g\) is any function, then \(g\left(\mathcal{M}(X)\right)\) also satisfies \(\epsilon\)-DP.
Rounding and eliminating impossible values like negative counts are common types of post-processing. There are also types of post-processing that can improve accuracy by leveraging information calculated from the same data set.
Consider a simulated data set with information about small businesses (0-20 employees) in Texas and Vermont.
set.seed(20220509)
small_businesses <- bind_rows(
Texas = tibble(
employees = rpois(n = 100010, lambda = 10),
income = rlnorm(n = 100010, meanlog = 10, sdlog = 2)
),
Vermont = tibble(
employees = rpois(n = 403, lambda = 10),
income = rlnorm(n = 403, meanlog = 10, sdlog = 2)
),
.id = "state"
) %>%
mutate(employees = if_else(employees > 20, 20L, employees))
Using the Laplace sanitizer, calculate the number of small businesses in Texas and Vermont (count) with the overall \(\epsilon = 0.1\). Use the parallel composition theorem.
ex3_conf <- count(small_businesses, state)
ex3_conf
set.seed(46)
bind_cols(
ex3_conf,
ex3_conf$n + laplace_sanitizer(
sensitivity = ### ______,
epsilon = ### ______,
n = 2
)
)
The observations from Texas and Vermont are disjoint, so we can use the full \(\epsilon = 0.1\) for each statistics instead of splitting it across the statistics.
ex3_conf <- count(small_businesses, state)
ex3_conf
## # A tibble: 2 x 2
## state n
## <chr> <int>
## 1 Texas 100010
## 2 Vermont 403
set.seed(46)
bind_cols(
ex3_conf,
n_dp = ex3_conf$n + laplace_sanitizer(
sensitivity = 1,
epsilon = 0.1,
n = 2
)
)
## # A tibble: 2 x 3
## state n n_dp
## <chr> <int> <dbl>
## 1 Texas 100010 100029.
## 2 Vermont 403 418.
The absolute error is larger for Texas, but the relative error is much bigger for Vermont.
Using the Laplace sanitizer, calculate the number of employees in the entire data set (sum) with the overall \(\epsilon = 0.1\). We know from auxiliary information that the number of employees varies from 0 to 20 because they are small businesses.
ex4_conf <- small_businesses %>%
summarize(employees = sum(employees))
set.seed(47)
bind_cols(
ex4_conf,
employees_dp = ex4_conf$employees + laplace_sanitizer(
sensitivity = ### ______,
epsilon = ### ______,
n = 1
)
)
ex4_conf <- small_businesses %>%
summarize(employees = sum(employees))
set.seed(47)
bind_cols(
ex4_conf,
employees_dp = ex4_conf$employees + laplace_sanitizer(
sensitivity = 20,
epsilon = 0.1,
n = 1
)
)
## # A tibble: 1 x 2
## employees employees_dp
## <int> <dbl>
## 1 1003755 1003807.
Approximate Differential Privacy, also known as \((\epsilon, \delta)\)-Differential Privacy is a relxation of \(\epsilon\)-Differential Privacy. We saw this definition above with the Gaussian sanitizer.
\((\epsilon, \delta)\)-Differential Privacy (Dwork, Kenthapadi, et al. 2006): A sanitization algorithm, \(\mathcal{M}\), satisfies \((\epsilon, \delta)\)-DP if for all \(X, X'\) that are \(d(X,X')=1\),
\[\Pr(\mathcal{M}( X) \in S)\le \exp(\epsilon) \Pr(\mathcal{M}( X')\in S) + \delta\]
where \(\delta\in [0,1]\).
We can think of \(\delta\) as a small probability that the bound created by \(\epsilon\) does not hold. \(\epsilon\)-DP is a special case of \((\epsilon, \delta)\)-DP when \(\delta=0\).
Zero-Concentrated Differential Privacy is another relaxation of \(\epsilon\)-Differential Privacy. This definition is used by the Census Bureau for the 2020 Decennial Census.
Zero-Concentrated Differential Privacy (Bun and Steinke 2016): A sanitization algorithm, \(\mathcal{M}\), satisfies \((\xi, \rho)\)-zero-concentrated differential privacy if for all \(X, X'\) that are \(d(X,X')=1\) and \(\alpha\in (1, \infty)\),
\[D_\alpha(\mathcal{M}(X)||\mathcal{M}(X'))\leq\xi+\rho\alpha\]
where \(D_\alpha(\mathcal{M}(X)||\mathcal{M}(X'))\) is the \(\alpha\)-R'enyi divergence % between the distribution of \(\mathcal{M}(X)\) and the distribution of \(\mathcal{M}(X')\), \(\xi\) and \(\rho\) are positive constants, and \(\alpha \in (1,\infty)\).
Zero-Concentrated Differential Privacy, also known as R'enyi Differential Privacy, only holds for counts.
Differential privacy states that the log of the ratio of the probability that any individual observation was in the data that generated the output vs. not in the data that generated the output is bounded by the value of \(\epsilon\).
\[\frac{\Pr(\mathcal{M}( X) \in S)}{ \Pr(\mathcal{M}( X')\in S)}\le \exp(\epsilon)\]
The bound is in exponential units, so modest increases in \(\epsilon\) correspond with large increases in the ratio of the probabilities.
Early differential privacy researchers thought \(\epsilon = 1\) or \(\epsilon = 2\) were upper bounds on \(\epsilon\). Today, much higher values of \(\epsilon\) are used. The April 2021 Decennial Census demonstration data used \(\epsilon = 4.0\) and \(\epsilon = 10.3\) for the person-level file. The Decennial Census ended up using \(\epsilon = 17.14\) for the person-level file.
Let’s consider the ratios fo the probabilities for different values of \(\epsilon\):
## # A tibble: 10 x 2
## epsilon ratio
## <dbl> <dbl>
## 1 0.25 1
## 2 0.5 2
## 3 0.75 2
## 4 1 3
## 5 2 7
## 6 4 55
## 7 6 403
## 8 8 2981
## 9 10.3 29733
## 10 17.1 27784809
It is tough to reason what a ratio of 27784809 even means.