1 Basics of Data Governance and Privacy

Published

April 23, 2026

Overview

For this introductory chapter, you will learn about:

Dimensions of data governance and their privacy implications.
Kinds of trade-offs associated with disclosure risk.
Definitions and interpretations of data privacy.

1.1 Why is Data Governance and Privacy Important?

Data governance ultimately concerns decision-making around how data flows or does not flow between parties. While different actors may be more concerned with expanding or restricting data flows, real-world harms can occur if data sharing is too permissive or too restrictive.

Example: SNAP data governance

Recent changes to SNAP administration reveal the limits of both too much data sharing and too little data sharing:

In October 2025, NPR reported that states turned over personal information on snap recipients to USDA. This new centralization of SNAP data at USDA poses many data governance questions, including but not limited to…
- Legal risk: is this data collection allowed under federal and state law?
- Institutional risk: is this data collection in line with expectations for how federal and state agencies collaborate?
- Data subject protection: is there transparency around how this data will be used, and will there be sufficient protections for data subjects?
In September 2025, NPR reported that the USDA is canceling the Household Food Security Report. This new redaction on nutritional accessibility data poses many data governance questions, including but not limited to…
- Data alternatives: what alternative data sources might users turn to in place of this report, and are they sufficient replacements?
- Evidence-based decision-making: what kinds of policy decisions at the federal, state, or local level are now lacking evidence due to this report being unavailable?
- Public accountability: how is the USDA justifying its decision to cancel the report, and how are the savings from canceling the report being used?

The following are additional real-world examples of the tensions between data access, usability, privacy, security, and ethics.

Click on the tab to learn more.

Screenshot of a Wall Street Journal article titled “No Place to Hide: Colleges Track Students, Everywhere”, with the subtitle “Schools use tech to follow students online, on the quad, and in the football stadium.” The article was written by Douglas Belkin and published on March 5, 2020.

The same data that can enable responsible, evidence-based decision-making can also raise legitimate privacy concerns. For example, some universities use smartphone apps to monitor student attendance—tracking when students arrive late, leave early, or miss class entirely, especially in large lecture halls with over a hundred students. These apps also record which campus facilities students use, such as libraries or gyms.

Making this data more widely available can help universities better understand student behavior, guide resource investments, or even support emergency alerts (e.g., during an active shooter event).
At the same time, such tracking could inadvertently reveal students’ identities, sensitive FERPA-protected information like grades, or fine-grained real-time student location, all of which raise privacy and safety concerns.

Different subpopulations experience different relationships to data privacy. For example, the right to privacy for disabled individuals is often compromised the moment they seek services or support.

Blind individuals may rely on medical devices with widely varying privacy standards. For example, virtual assistants or devices like Meta AI glasses help individuals navigate the world independently while raising concerns that companies like Meta may share or sell user data.
Disabled individuals may also face structural privacy barriers when accessing medical care. For example, health data for disabled individuals is routinely reported to the Centers for Medicare & Medicaid Services (CMS), often without explicit consent.
Simultaneously, disabled individuals often need their disability status disclosed to properly receive legible, timely, and actionable information. For example, during the Kerr County Flood, emergency evacuation warnings were not effectively issued to allow those with mobility challenges to successfully evacuate.

Those who collect and disseminate data may have different data privacy expectations and obligations than data subjects. For example, during a roundtable discussion hosted by the Council of the Section of Legal Education and Admissions to the American Bar Association (ABA), the ABA focused on reviewing, refining, and expanding demographic data collection practices for students, faculty, and staff.

Some universities may choose to withhold the number of transgender students enrolled in their law schools to protect individual privacy. Since transgender students are frequently a small minority of law student cohorts, choosing not to disclose this information can help protect the identities and “out” statuses of potentially affected transgender students.
While some students preferred this university approach, others preferred accurate representation in the data, believing it helps foster connection among transgender students and signals that their law school was inclusive and welcoming. Different universities reached different outcomes depending on how group representativeness was valued and who was involved in shaping these decisions.

1.2 Defining key stakeholders in the data ecosystem

For these training materials, this is how we define the various stakeholders in the data ecosystem.

The goal of this training is to equip data curators, practitioners, and other professionals with the technical, legal, and social evidence needed to support their decisions transparently and responsibly to those affected, such as data subjects.

1.3 What are the Dimensions of Data Governance?

Data governance concerns ensuring appropriate flow of information in data processing systems. However, there are countless approaches to both 1) determining what is appropriate flow, and 2) deciding how to ensure appropriate flow. To break it down, we consider…

Values: how might we want the flow of information to function?
Perspectives: what disciplinary tools, methods, and frameworks give us guidance on how to govern data?
Trade-offs: what aspects of data governance are in tension with one another?

1.3.1 Data Governance Values

Accuracy

Accuracy often refers to the quality of available data, ensuring that data meaningfully represents the data subjects.

Why it matters: Inaccurate data can lead to flawed analyses, poor decision-making, and misrepresentation of data subjects.

Key ideas:

Quantitative and qualitative data quality assessments.
End-to-end monitoring of data production processes.

Accessibility

Accessibility focuses on what kinds of data are accessible, by whom, and under what conditions.

Why it matters: Data should be available to those who need it for legitimate purposes.

Key ideas:

Different data access models for different user groups.
Technologies and policies that enforce different access models.

Usability

Usability ensures that data is understandable, actionable, and fit for purpose.

Why it matters: Highly accurate data is ineffective if users cannot interpret, use it, or otherwise act upon it.

Key ideas:

Clear and accessible documentation, metadata, and guidance on appropriate or inappropriate use.
Training and communication strategies for diverse user audiences.

Privacy

Privacy safeguards data subjects’ sensitive information from illegitimate access and use.

Why it matters:
Data curators are ultimately responsible for protecting the privacy rights of data subjects and maintaining their trust.

Key ideas:

Privacy and security risk assessments.
Technologies and policies to safely share data products.

1.3.2 Perspectives

Technical perspectives

Technical perspectives on data governance concern technological interventions for controlling data access and use.

Example interventions include, but are not limited to…

Privacy enhancing technologies (PETS): computational technologies that measure and/or restrict privacy risk in data processing.
Algorithmic fairness technologies: computational technologies that measure and/or restrict group disparities in data processing.

Legal perspectives

Legal perspectives on data governance concern laws and policies that ensure legally compliant data processing.

Example interventions include, but are not limited to…

Information privacy laws and their implementation.
Writing and executing data use agreements and policies.

Ethical perspectives

Ethical perspectives on data governance concern ethical practices for data governance decision-making and justification.

Example interventions include, but are not limited to…

Data governance codes of conduct.
Data-subject powered accountability mechanisms.

1.3.3 Trade-offs

Trade-off #1: Privacy-Utility

Increasing the quantity and quality of available data necessarily increases data subject privacy risks.

Privacy-utility Trade-off image — Figure 1.3

Definition: Data Utility, Quality, Accuracy, or Usefulness

Data utility, quality, accuracy, and usefulness is how practically useful and/or accurate the data are for research and analysis purposes.

Making more higher-quality data easily available necessarily increases the risk of data subject privacy risks.
- Ex1: providing record-level data instead of aggregated data makes it easier to single out individuals in datasets.
- Ex2: providing more detailed demographic data makes it easier to associate records with specific individuals.
Making data more private necessarily means providing less data or lower-quality data.
- Ex1: Data that has been altered for privacy protection purposes are necessarily harder to analyze than their unprotected counterparts.
- Ex2: Many public datasets have specific data fields entirely removed for privacy purposes, making them unusable.

In other words…

Greater Data Utility

Data quantity + quality
Ease of access
Permitted use and dissemination

Greater Data Privacy

Technical protections
Secure architectures
Rules and regulations

Trade-off #2: Security-Accessibility

Increasing the ease at which users access data necessarily increases data subject security risks.

Making it easier to access data necessarily makes it easier for unintended parties to access data.
- Ex1: hosting data publicly on websites allows automated processes like web crawlers, AI training data processes, etc. to use this data in insecure manners.
- Ex2: insecure file sharing practices (like sharing unencyrpted email attachments with collaborators) increases both convenience and security risks in case of an email breach.
Making it harder for adversarial actors to access data necessarily increases administrative burdens for intended parties.
- Ex1: multi-factor authentication for data access increases the time needed to gain approval to access a dataset.
- Ex2: analyzing data on secure computing systems can be slower and more burdensome than analyzing data on insecure systems.

Trade-off #3: Transparency-Trust

Increasing transparency about data processing necessarily reduces the trust in expertise needed to justify data governance decisions, for better and worse.

Failing to share sufficiently transparent information about data processing demands that users put more trust in data curators.
- Ex1: Restricting publicly available information about data collection methods requires users to trust that data was collected in a justifiable manner.
- Ex2: Restricting publicly available information about data processing risks requires users to trust that any risks were properly assessed by the data curators.
Sharing too much transparent information about data processing can unintentionally undermine trust in expertise needed to make nuanced or ambiguous data governance decisions.
- Ex1: sharing too much transparent information about appropriate or inappropriate use of data may inadvertently discourage otherwise appropriate data use.
- Ex2: sharing too much information about privacy risks associated with a dataset may unintentionally discourage data subjects from participating in data processing.

Note: Context matters

Data privacy best practices resist simple technical or legal standardization, as effective solutions often depend on nuanced and evolving contexts. All data privacy approaches perform best when evaluated holistically, with attention to the specific social, legal, and technical environments in which they operate. These practices also require navigating disagreements and competing priorities among stakeholders.

1.4 What is Data Governance in the Data Lifecycle?

A colorful circular diagram of the data lifecycle, showing six stages: data collection and acquisition, data storage, data sharing and transfer, data analysis, data dissemination, and data destruction and archival. — Figure 1.6

A data lifecycle diagram again, but with guiding questions for each part. — Figure 1.7

1.4.1 Data governance guiding questions and considerations throughout the data lifecycle

The following are some questions to consider for each stage of the data lifecycle in project planning or proposal development.

What values form the basis of the data lifecycle associated with this dataset?
How are these values implemented across each phase of the lifecycle?
Who or what holds ultimate responsibility for privacy, security, and ethical considerations throughout the data lifecycle, and how is accountability maintained?

1.4.2 Data collection

Why are the data being collected, and who are the intended users?

What and how data should be collected—or not collected?
How much do data subjects know about the data are being collected and for their use?
What kinds of decision-making would only be capable with access to new data, communicating both benefits and risks of data collection to subjects?
If data are being collected, who designs the data collection process?
Should demographic information be collected, and if so, how do we minimize risks to different populations when collecting such information?
How do we balance the need for information with the need to protect communities?

1.4.3 Data storage

What are the privacy and security responsibilities and protections when storing sensitive data?

Where and how are the data stored?
What additional information, such as metadata, is stored with the data?
Are the FAIR principles (Findable, Accessible, Interoperable, Reusable) applied?
Who has access to the data, how is that access restricted or not, and how do we ensure accessibility in that data access? E.g., secure enclaves, public data files and statistics.

1.4.5 Data analysis

How do we ensure proper data analysis?

How are the data being analyzed?
Do analysts have the right data and metadata to make informed analytic decisions about whether and how to process data?
Are there appropriate disclaimers, guidelines, or warnings to ensure outputs are used responsibly after analysis?

1.4.6 Data dissemination

How do the intended data products support their audiences?

How do we ensure insights are communicated in ways that are accessible, accurate, and useful while minimizing disclosure risks or harm?
What narratives can emerge from the same dataset, and how might they be framed differently?
How might audiences react to the insights generated?

1.4.7 Data archive and termination

Are the data subjects being responsibly represented or forgotten?

How should we inform participants about data destruction policies without hurting the data collection process?
When is it appropriate to archive data? Are there situations where destroying data could be unethical? e.g., disproportionally impact certain subpopulations?
How long should data be stored before they are securely destroyed?
If data are destroyed, what information should be preserved to ensure there is a record of its existence?

Class Activity 1

For the data you are working with, think through these questions identify the who (refer to the data ecosystem for help):

Who should define what data to collect—or not collect—shape the visibility of issues and the potential for policy change for your communities?
Who is best positioned to responsibly and ethically protect individuals and communities when collecting and storing sensitive data, especially demographic information?
Who should we entrust with data sharing practices? How can we help build and foster systems across institutions or sectors to promote transparency and trust?
Who should lead and contribute to data analysis and dissemination? What storytelling reinforces or challenges dominant narratives? How do we ensure responsible interpretation and use of insights?
Who should guide the long-term decisions about data archiving and destruction to ensure accountability, transparency, and historical integrity?

1.5 What are the Modes of Accessing Data and Statistics?

A data lifecycle diagram again, but with a focus on data sharing and transfer, data analysis, and data dissemination. — Figure 1.8

There are many versions of the data we should define.

Different levels of security and privacy are needed for different versions of the data. We will mostly focus on the confidential and public versions of the data. However, note that the original or raw data must also be securely stored and properly documented for future reference.

Definition: Original Dataset

Original dataset is the uncleaned, unprotected version of the data.

For example, raw 2020 Decennial Census microdata, which are never publicly released.

Definition: Confidential Data

Confidential is a dataset that contains personal or sensitive information that is, in general, not publicly accessible without specific provisions (for example, applying for restricted use permission).

For example, the Census Edited File that is the final confidential data for the 2020 Census. This dataset is never publicly released but may be made available to others who are sworn to protect confidentiality (i.e., Special Sworn Status) and who are provided access in a secure environment, such as a Federal Statistical Research Data Center.

Definition: Public Dataset or Statistics

Public dataset is the publicly released version of the confidential data.

For example, the US Census Bureau’s public tables and datasets or the Bureau of Labor Statistics reporting the unemployment rate statistics.

Data users have traditionally gained access to data via:

Direct or secure access to the confidential data if they are trusted users (e.g., obtaining Special Sworn Status to use the Federal Statistical Research Data Centers).
Access to public data or statistics, such as public microdata and summary tables, that the data curators and privacy experts produced with modification to protect confidentiality.

1.5.1 Tiered Access

Definition: Tiered Access

Tiered access is a data governance model that provides different levels of access to data users based on their needs and disclosure risks of their research projects.

Tiered access can include:

Public-use data files
Synthetic public-use data files
Restricted-use data in online data enclaves
Restricted-use data in on-premise research centers
Formally private query systems

The Urban Institute is developing public-use synthetic datasets and formally private validation servers as a model of tiered access in partnership with the Statistics of Income Division at the IRS.

Another example is All of Us Research Hub, one of the largest biomedical data resources of its kind that has data, research tools, and research projects.

1.5.2 Secure data access

Over the years, U.S. government agencies have been moving slowly toward allowing more data users direct access to the underlying cleaned data, under strict controls.

An example of direct data access is through a secure enclave, such as the Federal Statistical Research Data Centers.¹ This secure enclave became available in 1982 (then called the Center for Economic Studies), after data users demanded access to better quality data when the US Census Bureau became more aggressive with its applications of statistical data privacy methods on its data products.

Although more secure facilities are becoming available (for example, the National Science Foundation Secure Data Access Facility²), researchers face several challenges to obtaining this direct access. Full access to these data are only available to select U.S. government agencies, a limited number of data users working in collaboration with analysts from those agencies, or through highly selective research programs administered by these agencies. Further, data users are often required to be US citizens, undergo lengthy clearance processes to gain direct access (which can take months or years), and submit extensive research proposals.

Image of the locations of all 35 FSRDCs in the United States as of March 12, 2025.

Note: Federal Statistical Research Data Centers

As of March 2025, there are 35 Federal Statistical Research Data Centers across the United States. See the U.S. Census Bureau’s webpage on Federal Statistical Research Data Centers for the number and locations.

The 35 Federal Statistical Research Data Centers across the United States (including Puerto Rico!) may seem like enough to be geographically accessible to most data users. But that is not the case. These data centers are primarily located in places with large academic institutions.

Sometimes confidential data will need to be transferred to an external systems for further analysis. The following are two safe options that are standard ensure a secure file transfer:

File transfer using secure electronic connections. Most workplaces have Secure File Transfer Protocol (SFTP) servers to allow external parties to exchange data with them through encrypted connections. This also includes encrypted emails, file transfers to file hosting services (e.g., Dropbox), survey tools (e.g., Qualtrics), and other services using browser-based transfers with Transfer Layer Security (TLS).
File transfers using compressed (e.g., zipped), encrypted, password protected files to emails where the password is shared by another means by phone or text message and the encryption is FIPS 140-2 complaint, usually AES 128 or AES 256.

WARNING: DO NOT EMAIL DATA WITHOUT ENCRYPTION!

Do not share data through unencrypted file transfers over the Internet or in the body of, or as an unencrypted attachment to, an unencrypted email. Always consider some form of SFTP!

You can also restrict access by limiting the use of confidential variables. For example, if a file is considered confidential because it contains identifying names and addresses, those variables may be removed from the file and replaced with pseudo identifiers. The sanitized file can then be used and shared without risk of violating confidentiality. You can also regulate access restrictions by limiting people within your workplace from accessing specific computer accounts or files.

Class Activity 2

As we’ve explored secure data access methods, such as secure enclaves and restricted file transfers. It’s clear that strong protections are essential for safeguarding sensitive information.

Given the current infrastructure for secure data access (e.g., FSRDCs, remote secure setups, restricted file transfers), what are the privacy, security, ethical and social implications of these systems?

Things to consider in answering this question:

Who gets access to high-quality, confidential data—and who doesn’t?
How do geographic, institutional, or citizenship requirements influence equal opportunity in research?
Are current security measures proportionate to the risks they’re trying to mitigate?
How might these systems reinforce or challenge power imbalances in data use?

1.6 What Even is Data Privacy?

Class Activity 3

In one sentence, how would you personally define “privacy”?
After hearing others’ definitions, did you notice any similarities or differences? What do you think explains those patterns?

Data privacy is deeply multifaceted!

Technical controls for controlling data access
Legal agreements between entities governing data use
Social norms describing contextual information flows
Ethical practices for information access decision-making

1.6.1 Privacy and Confidentiality

Definition: Data Privacy

Data Privacy is determining and enforcing the appropriate flow of personal information through various data processes.

Definition: Confidentiality

Confidentiality is “the agreement, explicit or implicit, between data subject and data collector regarding the extent to which access by others to personal information is allowed” (Fienberg and Jin 2018).

When reviewing these definitions, it’s important to note that the terms data privacy and confidentiality are often used interchangeably, but they refer to distinct concepts.

Privacy centers on the various flows of personal information in the different processes. In contrast, confidentiality pertains to the responsibility of data curators to protect that information.
Both privacy and confidentiality aim to protect sensitive information and foster trust among the various entities involved in data sharing and access.

IMPORTANT: What is covered and not covered

Data privacy and confidentiality is a broad topic, which includes data security, encryption, access to data, etc. These materials do not cover privacy breaches from unauthorized access to a database (e.g., hackers).

There are differing notions of what should and shouldn’t be private, which may include being able to opt out of or opt into disclosure protections.

“Federal Statistical Research Data Centers (FSRDCs) are partnerships between federal statistical agencies and leading research institutions. FSRDCs provide secure environments supporting qualified researchers using restricted-access data while protecting respondent confidentiality.” From the U.S. Census Bureau’s webpage on Federal Statistical Research Data Centers.↩︎
The National Science Foundation Secure Access Facility provides authorized researchers secure remote access to National Center for Science and Engineering Statistics data and metadata, such as the Survey of Earned Doctorates and the national Survey of Recent College Graduates.↩︎

1.1 Why is Data Governance and Privacy Important?

1.2 Defining key stakeholders in the data ecosystem

1.3 What are the Dimensions of Data Governance?

1.3.1 Data Governance Values

Accuracy

Accessibility

Usability

Privacy

1.3.2 Perspectives

Technical perspectives

Legal perspectives

Social perspectives

Ethical perspectives

1.3.3 Trade-offs

1.4 What is Data Governance in the Data Lifecycle?

1.4.1 Data governance guiding questions and considerations throughout the data lifecycle

1.4.2 Data collection

1.4.3 Data storage

1.4.4 Data sharing

1.4.5 Data analysis

1.4.6 Data dissemination

1.4.7 Data archive and termination

1.5 What are the Modes of Accessing Data and Statistics?

1.5.1 Tiered Access

1.5.2 Secure data access

1.6 What Even is Data Privacy?

1.6.1 Privacy and Confidentiality