options(scipen = 999)

library(tidyverse)
library(urbnthemes)
library(smoothmest)

set_urbn_defaults()

Recap

Privacy loss budget

Differential privacy uses the concept of a privacy loss budget, typically represented mathematically as \(\epsilon\). The privacy loss budget bounds the privacy risk associated with releasing data or query results (Census Bureau 2022).

(Note: \(\epsilon\) is not the only privacy loss parameter, but we will use it here as a general representation of the privacy loss budget.)

  • A larger value of \(\epsilon\) increases the maximum disclosure risk (the upper bound of the disclosure risk) associated with a given release of information.
    • larger \(\epsilon\) = less noise added to data = more accuracy, but less privacy
    • smaller \(\epsilon\) = more noise added to data = less accuracy, but more privacy
  • Extreme cases (note that these cases are not realistic in the sense of real-world applications, but are presented to demonstrate the intuition):
    • \(\epsilon \to \infty\)
      • all privacy will be lost; data retains all utility, but no privacy
    • \(\epsilon \to 0\)
      • no privacy is lost; data is completely distorted and no utility remains

\(\epsilon\)-Differential privacy

Differential Privacy (Dwork, McSherry, et al. 2006): A sanitization algorithm, \(\mathcal{M}\), satisfies \(\epsilon\)-DP if for all subsets \(S\subseteq Range(\mathcal{M})\) and for all \(X,X'\) such that \(d(X,X')=1\),

\[\frac{\Pr(\mathcal{M}( X) \in S)}{ \Pr(\mathcal{M}( X')\in S)}\le \exp(\epsilon)\]

where \(\epsilon>0\) is the privacy loss budget and \(d(X,X')=1\) represents the possible ways that \(X'\) differs from \(X\) by one record.




Global Sensitivity

Global Sensitivity is a term which describes how resistant the differentially private sanitizer is to the presence of outliers (Bowen and Garfinkel 2021). It is quantified by how much an output can change by the addition or removal of the most extreme possible record that could exist in the population (regardless of whether that record is actually present in the data).


\(l_1\)-Global Sensitivity

Conceptual

\(l_1\)-Global Sensitivity (Dwork, McSherry, et al. 2006): The maximum amount a statistic can change in absolute value terms with the addition or removal of the most extreme possible observation.


Technical

\(l_1\)-Global Sensitivity (Dwork, McSherry, et al. 2006): For all \(X,X'\) such that \(d(X,X')=1\), the global sensitivity of a function \(M\) is

\[\Delta_1 (M)= \underset{d(X,X')=1}{\text{sup}} \|M(X)-M(X') \|_1\]


For scalars, the \(l_1\)-Global Sensitivity is \(|M(X) - M(X')|\).


\(l_2\)-Global Sensitivity

Conceptual

\(l_2\)-Global Sensitivity (Dwork, McSherry, et al. 2006): The maximum amount a statistic can change with the addition or removal of the most extreme possible observation. In this case, the statistic is squared, summed, and rooted.


Technical

\(l_2\)-Global Sensitivity (Dwork, McSherry, et al. 2006): For all \(X,X'\) such that \(d(X,X')=1\), the global sensitivity of a function \(M\) is

\[\Delta_2 (M)= \underset{d(X,X')=1}{\text{sup}} \|M(X)-M(X') \|_2\]


For scalars, the \(l_2\)-Global Sensitivity is \(\sqrt{(M(X) - M(X'))^2}\).


Exercise 1

Question

Suppose we are interested in counting the number of sole proprietorships in Washington, DC. What are the \(l_1\) and \(l_2\) global sensitivities of this statistic?



Hints

In other words, what is the maximum difference between \(M(X)\) and \(M(X')\) when \(d(X,X')=1\)?

The answer is one. The most a count can change by adding or subtracting one observation is one!

\(\Delta_1 (M) = \Delta_2 (M) = 1\)




Exercise 2

Question

Suppose we are interested in calculating the total income of sole proprietorships in Washington, DC. What are the \(l_1\) and \(l_2\) global sensitivities of this statistic?



Hints

In other words, what is the maximum difference between \(M(X)\) and \(M(X')\) when \(d(X,X')=1\)?

The answer is \(\infty\). A total can theoretically change by any amount with the addition or deletion of one observation.




Statistics

Counts

Counts are the best explored statistics in differential privacy. With unbounded differential privacy, the global sensitivity of a count is always 1.

For example, assume we are counting the number of billionaires in the United States. The most the count can change with the addition or removal of Jeff Bezos is one.


Sums

Calculating the global sensitivity is more difficult for other statistics than counts. The global sensitivity of a sum is unbounded because the addition or removal of one row can theoretically change the sum by any amount.

One approach is to clip or truncate values. If we assume that all observations are between 6 and 10, inclusive, then we can treat the global sensitivity as \(10 - 6 = 4\).

  • Differential privacy does not hold if we look at the data to determine the bounds.
  • Bounds that truncate actual values bias statistics.
  • This assumption is often problematic with economic data where distributions can be highly skewed.

Means

Means can be rewritten as two queries: a total divided by a count.

  1. GS(sum) / GS(count)

Sometimes the number of observations is assumed to be known. In this case, the global sensitivity is smaller.

  1. GS(sum) / n if we assume n is known




DP Sanitizers

A sanitizer protects against disclosure risk. A differentially private sanitizer protects against disclosure risk and meets the definition of differential privacy. If we know the global sensitivity of statistics, then we can often add noise in a way that sanitizers satisfy differential privacy.

Laplace sanitizer

Sanitizer

Conceptual

The Laplace sanitizer satisfies \(\epsilon\)-DP by adding noise from a Laplace distribution to statistics. More sensitivity means more expected noise. More \(\epsilon\) means less expected noise.

Technical

Laplace Mechanism (Dwork, McSherry, et al. 2006): The Laplace Mechanism satisfies \(\epsilon\)-DP by adding noise \((\eta_1,\ldots,\eta_k)\) to \(M\) that are independently drawn from a Laplace distribution with the location parameter at 0 and scale parameter of \(\frac{\Delta_1(M)}{\epsilon}\) such that

\[\mathcal{M}(X)=M(X)+(\eta_1,\ldots,\eta_k)\]

The Laplace sanitizer uses \(l_1\)-Global Sensitivity.

Example

Let’s consider a simple example with the Palmer Penguins data set. The data set contains 333 observations about Adelie, Chinstrap, and Gentoo penguins in Antarctica. Suppose we want to count how many penguins are Adelie penguins.

penguins <- palmerpenguins::penguins %>%
  drop_na()

penguins
## # A tibble: 333 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # i 323 more rows
## # i 2 more variables: sex <fct>, year <int>


The global sensitivity is \(\frac{\Delta_1(M)}{\epsilon} = \frac{1}{\epsilon}\). This means our differentially private statistic is one draw from a Laplace distribution with center at the confidential statistics and scale parameter equal to \(\frac{1}{\epsilon}\).

We’ll use the laplace_sanitizer() function from week 4:


# function to draw Laplace noise for one statistic 
laplace_sanitizer <- function(sensitivity, epsilon, n = 1) {
  
  # lambda (distribution width) is sensitivity/privacy loss budget
  l <- sensitivity / epsilon 
  
  # draw from Laplace distribution
  noise <- smoothmest::rdoublex(n = n, # draw one observation (adding noise to one statistic)
                                mu = 0, # centered at 0
                                lambda = l) # scale based on l calculated above
  
  return(noise)
  
}


Let’s calculate the statistic without any noise.

answer_conf <- sum(penguins$species == "Adelie")

answer_conf
## [1] 146


Now, let’s calculate the statistic with noise that satisfies the definition of \(\epsilon\)-differential privacy.

set.seed(1)
answer_dp <- answer_conf + laplace_sanitizer(sensitivity = 1, epsilon = 0.1)

answer_dp
## [1] 153.5518


Maybe we got a lucky or unlucky draw from the Laplace distribution. Let’s calculate this statistic 100 times to understand the distribution of noisy statistics. This is purely for demonstration to understand the expectation of the noisy statistic.


Gaussian Sanitizer

Sanitizer

Conceptual

The Gaussian sanitizer satisfies \(\epsilon\)-DP by adding noise from a Gaussain distribution (also known as Normal distribution or bell curve) to statistics. More sensitivity means more expected noise. More \(\epsilon\) means less expected noise.

Technical

Gaussian Mechanism (Dwork and Roth 2014): The Gaussian Mechanism satisfies \((\epsilon,\delta)\)-DP by adding Gaussian noise with zero mean and variance, \(\sigma^2\), such that

\[\mathcal{M}(X)=M(X)+(\eta_1,\ldots,\eta_k)\]

where \(\eta_1,\ldots,\eta_k\) are independently drawn and \(\sigma=\frac{\Delta_2(M)\sqrt{2 \log(1.25/\delta)}}{\epsilon}\).

This sanitizer includes two parameters: \(\epsilon\) and \(\delta\). We can think of \(\delta\) as a small probability that the bound created by \(\epsilon\) does not hold.

The Gaussian sanitizer uses \(l_2\)-Global Sensitivity.

Example

# function to draw Laplace noise for one statistic 
gaussian_sanitizer <- function(sensitivity, epsilon, delta) {
  
  # lambda (distribution width) is sensitivity/privacy loss budget
  sigma <- (sensitivity * sqrt(2 * log(1.25 / delta))) / epsilon 
  
  # draw from Laplace distribution
  noise <- rnorm(n = 1, # draw one observation (adding noise to one statistic)
                 mean = 0,
                 sd = sigma) # scale based on l calculated above
  
  return(noise)
  
}


Let’s calculate the statistic without any noise.

answer_conf <- sum(penguins$species == "Adelie")

answer_conf
## [1] 146


Now, let’s calculate the statistic with noise that satisfies the definition of \((\epsilon, \delta)\)-differential privacy.

set.seed(1)
answer_dp <- answer_conf + gaussian_sanitizer(sensitivity = 1, epsilon = 0.1, delta = 10^-7)

answer_dp
## [1] 110.1865


Maybe we got a lucky or unlucky draw from the Normal distribution. Let’s calculate this statistic 100 times to understand the distribution of noisy statistics. This is purely for demonstration to understand the expectation of the noisy statistic.

The Gaussian sanitizer is worse than the Laplace sanitizer! So why do we even need a Gaussian sanitizer?

The Gaussian sanitizer can compose better for multiple queries. This is because the sum of two normally distributed random variables is normally distributed but the sum of two Laplacian distributed variables is not Laplacian.


Exponential Sanitizer

Sanitizer

Exponential Mechanism (F. McSherry and Talwar 2007): The Exponential Mechanism releases values with a probability proportional to

\[\left(\frac{\epsilon u(X, \theta)}{2\Delta_1(u)}\right)\]

and satisfies \(\epsilon\)-DP, where \(u(X,\theta)\) is the score or quality function that determines the values for each possible output, \(\theta\), on \(X\).

The Exponential sanitizer uses \(l_1\)-Global Sensitivity.




Important Theorems

Sequential Composition Theorem

The sequential composition theorem (F. D. McSherry 2009; Dwork and Rothblum 2016; Bun and Steinke 2016) allows us to calculate multiple noisy statistics under the same privacy budget.

Suppose a mechanism, \(\mathcal{M}_j\), provides \(\epsilon_j\)-DP. The sequence of \(\mathcal{M}_j(X)\) applied on the same \(X\) provides \(\sum_{j=1}^J\epsilon_j\)-DP.


Example

Let’s return the penguins example from above. Suppose \(\epsilon = 1\) and we want to count the number of “Adelie” penguins and the number of “Chinstrap” penguins.

epsilon <- 1

set.seed(20220505)

sum(penguins$species == "Adelie") + laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 2)
## [1] 146.7052
sum(penguins$species == "Chinstrap") + laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 2)
## [1] 78.36747


For reference, let’s look at the truth.

sum(penguins$species == "Adelie")
## [1] 146
sum(penguins$species == "Chinstrap")
## [1] 68

Parallel Composition Theorem

The parallel composition theorem (F. D. McSherry 2009; Dwork and Rothblum 2016; Bun and Steinke 2016) allows us to conserve \(\epsilon\) (or increase accuracy) when statistics come from partitions of the data (for example, different states).

Let \(D_j\) be disjoint subsets of the input domain \(D\). The sequence of \(\mathcal{M}_j(X\cap D_j)\) provides \(\max_{j \in \{1,\ldots,J\}} \epsilon_j\)-DP


Example

Let’s consider a larger data set with 53,940 observations about diamonds. Suppose we want to calculate a differenitally private histogram of diamond sizes with bins [0, 1], (1, 2], (2, 3], (3, 4], (4, 5], and (5,6] with \(\epsilon = 0.1\).

diamonds_conf <- count(diamonds, carat = ceiling(carat))

diamonds_conf
## # A tibble: 6 x 2
##   carat     n
##   <dbl> <int>
## 1     1 36438
## 2     2 15613
## 3     3  1857
## 4     4    27
## 5     5     4
## 6     6     1


One approach is to use \(\frac{\epsilon = 0.1}{6}\) for each of the six counting queries. This is based on sequential composition.

epsilon <- 0.1

set.seed(10)

diamonds_conf <- bind_cols(
  diamonds_conf,
  sequential = diamonds_conf$n + laplace_sanitizer(sensitivity = 1, epsilon = epsilon / 6, n = 6)
)

diamonds_conf
## # A tibble: 6 x 3
##   carat     n sequential
##   <dbl> <int>      <dbl>
## 1     1 36438    36437. 
## 2     2 15613    15558. 
## 3     3  1857     1902. 
## 4     4    27      -67.5
## 5     5     4       17.9
## 6     6     1       66.2


The bins for carat partition the data set and each bin is a disjoint subset of the data. Therefore, we can use parallel composition and get more accurate differentially private counts!

set.seed(11)

diamonds_conf <- bind_cols(
  diamonds_conf,
  parallel = diamonds_conf$n + laplace_sanitizer(sensitivity = 1, epsilon = epsilon, n = 6)
)

diamonds_conf
## # A tibble: 6 x 4
##   carat     n sequential parallel
##   <dbl> <int>      <dbl>    <dbl>
## 1     1 36438    36437.  36446.  
## 2     2 15613    15558.  15683.  
## 3     3  1857     1902.   1857.  
## 4     4    27      -67.5    69.0 
## 5     5     4       17.9    28.6 
## 6     6     1       66.2     9.53

Post-Processing Theorem

The Post-Processing Theorem (Bun and Steinke 2016; Dwork, McSherry, et al. 2006; Nissim, Raskhodnikova, and Smith 2007) allows us to use differentially private information in any way we see fit without releasing further information.

If \(\mathcal{M}\) is a sanitizer that satisfies \(\epsilon\)-DP and \(g\) is any function, then \(g\left(\mathcal{M}(X)\right)\) also satisfies \(\epsilon\)-DP.

Rounding and eliminating impossible values like negative counts are common types of post-processing. There are also types of post-processing that can improve accuracy by leveraging information calculated from the same data set.


Exercise 3

Consider a simulated data set with information about small businesses (0-20 employees) in Texas and Vermont.

set.seed(20220509)
small_businesses <- bind_rows(
  Texas = tibble(
    employees = rpois(n = 100010, lambda = 10),
    income = rlnorm(n = 100010, meanlog = 10, sdlog = 2)
  ),
  Vermont = tibble(
    employees = rpois(n = 403, lambda = 10),
    income = rlnorm(n = 403, meanlog = 10, sdlog = 2)
  ),
  .id = "state"
) %>%
  mutate(employees = if_else(employees > 20, 20L, employees))


Question

Using the Laplace sanitizer, calculate the number of small businesses in Texas and Vermont (count) with the overall \(\epsilon = 0.1\). Use the parallel composition theorem.

ex3_conf <- count(small_businesses, state)

ex3_conf

set.seed(46)
bind_cols(
  ex3_conf,
  ex3_conf$n + laplace_sanitizer(
    sensitivity = ### ______, 
    epsilon = ### ______, 
    n = 2
  )
)
  • Which state has more absolute error introduced into its count?
  • Which state has more relative error introduced into its count?



Hints

The observations from Texas and Vermont are disjoint, so we can use the full \(\epsilon = 0.1\) for each statistics instead of splitting it across the statistics.

ex3_conf <- count(small_businesses, state)

ex3_conf
## # A tibble: 2 x 2
##   state        n
##   <chr>    <int>
## 1 Texas   100010
## 2 Vermont    403
set.seed(46)
bind_cols(
  ex3_conf,
  n_dp = ex3_conf$n + laplace_sanitizer(
    sensitivity = 1, 
    epsilon = 0.1, 
    n = 2
  )
)
## # A tibble: 2 x 3
##   state        n    n_dp
##   <chr>    <int>   <dbl>
## 1 Texas   100010 100029.
## 2 Vermont    403    418.

The absolute error is larger for Texas, but the relative error is much bigger for Vermont.



Exercise 4

Question

Using the Laplace sanitizer, calculate the number of employees in the entire data set (sum) with the overall \(\epsilon = 0.1\). We know from auxiliary information that the number of employees varies from 0 to 20 because they are small businesses.

ex4_conf <- small_businesses %>%
  summarize(employees = sum(employees))

set.seed(47)
bind_cols(
  ex4_conf,
  employees_dp = ex4_conf$employees + laplace_sanitizer(
    sensitivity = ### ______, 
    epsilon = ### ______, 
    n = 1
  )
)



Hints

ex4_conf <- small_businesses %>%
  summarize(employees = sum(employees))

set.seed(47)
bind_cols(
  ex4_conf,
  employees_dp = ex4_conf$employees + laplace_sanitizer(
    sensitivity = 20, 
    epsilon = 0.1, 
    n = 1
  )
)
## # A tibble: 1 x 2
##   employees employees_dp
##       <int>        <dbl>
## 1   1003755     1003807.




Other Formal Privacy Definitions

Approximate Differential Privacy

Approximate Differential Privacy, also known as \((\epsilon, \delta)\)-Differential Privacy is a relxation of \(\epsilon\)-Differential Privacy. We saw this definition above with the Gaussian sanitizer.

\((\epsilon, \delta)\)-Differential Privacy (Dwork, Kenthapadi, et al. 2006): A sanitization algorithm, \(\mathcal{M}\), satisfies \((\epsilon, \delta)\)-DP if for all \(X, X'\) that are \(d(X,X')=1\),

\[\Pr(\mathcal{M}( X) \in S)\le \exp(\epsilon) \Pr(\mathcal{M}( X')\in S) + \delta\]

where \(\delta\in [0,1]\).


We can think of \(\delta\) as a small probability that the bound created by \(\epsilon\) does not hold. \(\epsilon\)-DP is a special case of \((\epsilon, \delta)\)-DP when \(\delta=0\).


Zero-Concentrated Differential Privacy

Zero-Concentrated Differential Privacy is another relaxation of \(\epsilon\)-Differential Privacy. This definition is used by the Census Bureau for the 2020 Decennial Census.

Zero-Concentrated Differential Privacy (Bun and Steinke 2016): A sanitization algorithm, \(\mathcal{M}\), satisfies \((\xi, \rho)\)-zero-concentrated differential privacy if for all \(X, X'\) that are \(d(X,X')=1\) and \(\alpha\in (1, \infty)\),

\[D_\alpha(\mathcal{M}(X)||\mathcal{M}(X'))\leq\xi+\rho\alpha\]

where \(D_\alpha(\mathcal{M}(X)||\mathcal{M}(X'))\) is the \(\alpha\)-R'enyi divergence % between the distribution of \(\mathcal{M}(X)\) and the distribution of \(\mathcal{M}(X')\), \(\xi\) and \(\rho\) are positive constants, and \(\alpha \in (1,\infty)\).

Zero-Concentrated Differential Privacy, also known as R'enyi Differential Privacy, only holds for counts.



Unpacking \(\epsilon\)

Differential privacy states that the log of the ratio of the probability that any individual observation was in the data that generated the output vs. not in the data that generated the output is bounded by the value of \(\epsilon\).

\[\frac{\Pr(\mathcal{M}( X) \in S)}{ \Pr(\mathcal{M}( X')\in S)}\le \exp(\epsilon)\]

The bound is in exponential units, so modest increases in \(\epsilon\) correspond with large increases in the ratio of the probabilities.

Early differential privacy researchers thought \(\epsilon = 1\) or \(\epsilon = 2\) were upper bounds on \(\epsilon\). Today, much higher values of \(\epsilon\) are used. The April 2021 Decennial Census demonstration data used \(\epsilon = 4.0\) and \(\epsilon = 10.3\) for the person-level file. The Decennial Census ended up using \(\epsilon = 17.14\) for the person-level file.

Let’s consider the ratios fo the probabilities for different values of \(\epsilon\):

## # A tibble: 10 x 2
##    epsilon    ratio
##      <dbl>    <dbl>
##  1    0.25        1
##  2    0.5         2
##  3    0.75        2
##  4    1           3
##  5    2           7
##  6    4          55
##  7    6         403
##  8    8        2981
##  9   10.3     29733
## 10   17.1  27784809

It is tough to reason what a ratio of 27784809 even means.



Key Takeaways

  • Differential privacy places a bound on the amount of information released under extreme assumptions about the knowledge of an attacker and their computing power.
  • Global sensitivity measures how much a statistic can change with the addition or removal of the most extreme possible value.
  • Sanitizers, like the Laplace sanitizer, satisfy differential privacy by adding a specific amount of random noise to statistics.
  • Higher values of \(\epsilon\) mean more information is potentially released.
  • Sanitizers applied to statistics with higher global sensitivity require more noise to satisfy a definition of differential privacy than with statistics with lower global sensitivity.


References

Bun, Mark, and Thomas Steinke. 2016. “Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds.” In Theory of Cryptography Conference, 635–58. Springer.
Dwork, Cynthia, Krishnaram Kenthapadi, Fang McSherry, Ilya Mironov, and Moni Naor. 2006. “Our Data, Ourselves: Privacy via Distributed Noise Generation.” In Annual International Conference on the Theory and Applications of Cryptographic Techniques, 486–503. Springer.
Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. “Calibrating Noise to Sensitivity in Private Data Analysis,” 265–84.
Dwork, Cynthia, and Aaron Roth. 2014. “The Algorithmic Foundations of Differential Privacy.” Foundations and Trends in Theoretical Computer Science 9 (3–4): 211–407.
Dwork, Cynthia, and Guy N Rothblum. 2016. “Concentrated Differential Privacy.” arXiv.
McSherry, Frank D. 2009. “Privacy Integrated Queries: An Extensible Platform for Privacy-Preserving Data Analysis.” In Proceedings of the 2009 Association for Computing Machinery’s Special Interest Group on Management of Data International Conference on Management of Data, 19–30. Association for Computing Machinery.
McSherry, Frank, and Kunal Talwar. 2007. “Mechanism Design via Differential Privacy.” In 48th Annual Institute of Electrical and Electronics Engineers Symposium onFoundations of Computer Science, 94–103. Institute of Electrical; Electronics Engineers.
Nissim, Kobbi, Sofya Raskhodnikova, and Adam Smith. 2007. “Smooth Sensitivity and Sampling in Private Data Analysis.” In Proceedings of the 39th Annual Association for Computing Machinery Symposium on Theory of Computing, 75–84. Association for Computing Machinery.