2 The tidysynthesis philosophy
tidysynthesis is a “metapackage” for creating synthetic data sets for statistical disclosure control that shares the underlying design philosophy, grammar, and data structures of the tidyverse and tidymodels.
tidysynthesis is currently library(tidysynthesis)
and library(syntheval)
, but will likely break into more packages moving forward.
2.1 For Users
tidysynthesis is designed around a few key principles that benefit users. tidysynthesis should:
- be designed for humans.
- give users the full predictive modeling toolkit available in tidymodels.
- embrace small, clear functions over big functions with many arguments.
- reuse objects so small changes in synthesis specifications require marginal changes to code.
- catch mistakes before computation.
- contain robust documentation.
2.1.1 1. Human-centered design
We want users to inform the design of tidysynthesis.
Human time is often more expensive than computing time. When faced with difficult trade offs, tidysynthesis sacrifices computation speed for clarity. When faced with difficult trade offs, tidysynthesis sacrifices brevity for clarity.
tidysynthesis is modular so it is easier to update its design based on user and developer feedback.
2.1.2 2. tidymodels
tidysynthesis synthesizes data sets as a sequence of predictive models. The more predictive modeling tools available to users, the better.
tidymodels is a collection of packages for modeling and machine learning using tidyverse principles. It provides tools for the full predictive modeling workflow and offers a common interface to most predictive modeling packages in R.
tidysynthesis aims to leverage all of the hard work done by the developers of tidymodels. tidysynthesis contains regularized regression models because tidymodels contains regularized regression models. tidysynthesis contains feature and target engineering because tidymodels contains feature and target engineering.
We highly recommend learning more about tidymodels from the tidymodels tutorial and Tidy Modeling with R.
2.1.3 3. Functional
R is a functional programming language. tidysynthesis uses small, clear functions to change the behavior of syntheses instead of YAML headers or configuration files. This allows for 4. and 5.
2.1.4 4. Reuse objects
Users often want to test multiple approaches to synthesis. When they do, they can reuse most of their code across syntheses. For example, if a user only wants to vary constraints during synthesis, they only need to run constraints()
, presynth()
, and synthesize()
. They can reuse their visit sequence, roadmap, synth_spec, noise, and replicates.
2.1.5 5. Lazy computation
Users make mistakes. tidysynthesis should minimize the chance of a user making a mistake and maximize the chance of catching a mistake when it inevitably happens. Data synthesis can be computationally very expensive, so it is paramount to catch mistakes early.
tidysynthesis is lazy, so no serious computation happens until the synthesize()
function is called. tidysynthesis functions contain robust checks for inputs, so many errors are caught before synthesize()
.
For example, roadmap()
tests to make sure that all variables in the visit_sequence()
are present in the confidential data. For example, synth_spec()
tests to make sure that appropriate feature and target engineering, algorithms, and samplers are included for all variables in the visit sequence.
2.1.6 6. Contain robust documentation
This book is an effort to clearly document tidysynthesis. We’ve included several examples and hope to include more examples.
2.2 For Developers
We want tidysynthesis to be a community-developed tool. We’ve embraced a few principles to make it easier for developers to contribute to the package.
- tidysynthesis’s modular design makes it more extensible. This is inspired by ggplot2 extensions and extensions to tidymodels.
- tidysynthesis contains hundreds of tests to ensure that changes don’t break the package.
- Robust documentation about “the why” should hopefully orient development.
2.3 Inspiration
2.3.1 tidytools manifesto
The tidytools manifesto changed R. tidysynthesis is heavily inspired by a few of its principles:
- Reuse existing data structures.
- Compose simple functions with the pipe.
- Embrace functional programming.
- Design for humans.
- Design small packages that work well together.
- Build extensible tools.
2.3.2 tidymodels conventions
tidymodels is a herculean effort with clear principles that inspired our endeavor. In particular:
- All results should be reproducible from run-to-run.
- Retain minimally sufficient objects in the model object.
- Every class should have a print method that gives a concise description of the object.
2.3.3 synthpop
synthpop is “a tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis.”
tidysynthesis would not exist without the example set by synthpop and its authors Beata Nowok, Gillian M Raab, Chris Dibben, Joshua Snoke, and Caspar van Lissa. It was added to CRAN in summer of 2014 and is a groundbreaking tool.
We decided to build a new package for two reasons.
- synthpop’s design makes it very difficult to extend. It is especially difficult to extend for developers who were not involved in the original creation of synthpop.
- synthpop does not have much of a GitHub presence. Until recently, the only repository on GitHub was CRAN mirror. Since 2014, multiple bootstrapped and modified versions of synthpop have circulated without much progress to centralize its open source development.