Git and GitHub with Confidential Data

How to Use Git and GitHub with Confidential Data on the Y Drive (Ares Drive)

So you’ve learned about the power of Git and GitHub and now want to use it in your projects. That’s great news! But if your project data is stored on the Y Drive for confidentiality reasons, you’ll have to jump through a few extra hoops. Below are instructions on one way to set up a GitHub repo to play nicely with the Y Drive.

Important

In some cases, using Git and GitHub with confidential data is altogether not allowed due to strict data use agreements or other contractual requirements. If you aren’t sure if you can use Git and GitHub with your data, please reach out to Urban’s security team and/or ask in the #github Slack channel.

1 The big picture

  • Store all data directly on the Y Drive
  • Store all scripts in a Git repo
  • Create and clone the repo on the virtual machine (e.g. Urban Users PGP or Win 10 Secure) local drive
  • Keep that repo with the scripts completely separate from the Y Drive folder with the data
  • Within your scripts, always read in data using the direct Y Drive filepaths and write out intermediate/final data files using direct Y Drive filepaths

This means your scripts will live in a separate place (usually the Documents folder of the virtual desktop) from your data (somewhere in the Y Drive) and only your scripts will be on Git/GitHub. So while changes to your scripts will be tracked with all the power of GitHub, your data will not.

If your project only has non-sensitive data files, it is best practice to put the data and scripts directly in the same Git repo as that makes it easier to work with collaborators and quickly understand project structures. But if you work with confidential data, you can’t put your data in Git/GitHub for security reasons (and therefore won’t get the power of Git and file tracking for the data).

Now that’s out of the way, how do you setup Git/GitHub with your confidential data?

2 Set up the Y Drive folder (for data)

  1. Use the intake form to request a new folder for your project and give the appropriate staff access.

  2. Securely move the project data files into the assigned Y Drive folder. Inside the folder, we recommend the following folder structure:

    - projectfolder/
        - data/
            - raw-data/
            - intermediate-data/
            - final-data/

The data you move to the Y Drive should go into the data/raw-data folder and manipulated data can go into the other folders.

Note that because the data files aren’t in the Git repo, it will be possible for members of your team to accidentally overwrite the intermediate/final data files in the Y Drive if multiple people run the same script. For this reason, we suggest a) keeping the data files inside raw-data untouched and never programmatically writing to that folder and b) keeping a clear script copy of how you transform the files inside raw-data so that you can easily regenerate intermediate and final data files.

3 Set up the Git repo (for scripts)

  1. Use a virtual desktop to open Win10 Secure or PGP. Note that Win10 Secure has SAS installed while PGP does not. PGP might be more powerful/have more memory if you have computationally expensive code. For additional information, see Urban’s Overview of Computing Resources.

  2. If necessary, install GitHub Desktop from this link: https://desktop.github.com/.

  3. Open up GitHub Desktop and login to your GitHub account.

  4. Click Add > Create new repository.

  5. We recommend initializing the repo with a README (with background information/instructions for running your code) and a .gitignore file to tell Git not to track certain files. Here is an example of a strict .gitignore file that ignores all files (e.g. data files) except for R, SAS, and Stata scripts.

    # Ignore everything
    *.*
    
    # Include .gitignore and README.md
    !.gitignore
    !README.md
    
    # Include R scripts 
    !*.R
    !*.Rproj
    
    # Include SAS scripts 
    !*.sas
    
    # Include Stata scripts 
    !*.do
    
  6. Very importantly, when selecting a local path, make sure that the repo is created within the local drive of the virtual machine you are on, and NOT the mapped Y: drive. By default, GitHub Desktop should try to create the repo in \\Ares\CTX_RedirectedFolders$\username\My Documents\GitHub. While that may look weird, that is an acceptable filepath and will put the repo in the My Documents\GitHub folder of the virtual desktop. Once you’re ready, select Create Repository.

  7. This should create the repository in your selected location on the Windows Explorer. If you need help finding the exact location on your computer, go to GitHub Desktop, select Repository in the top left and then Show in File Explorer.

  8. Click Publish Repo in the top right of GitHub Desktop to send the repo to GitHub.com

  9. Inside the Git repo (i.e., folder) we suggest the following folder structure:

    README.md
    scripts/
        somescript.R
        someotherscript.R

    Where the scripts folder contains all code for the project. Feel free to create subfolders as you see fit.

  10. Make changes to your project, like updating the README, or adding scripts. Then commit the changes on GitHub Desktop and push the changes.

  11. Each member of your project team will need to clone the repo into on their respective virtual machines (Urban Users PGP or Win 10 Secure). Again, do NOT clone the repo into the mapped Y Drive.

4 Write scripts and commit to Git/GitHub

Now you can write scripts as you normally would. The only caveat is that all data you read in/write out will have to be to the mapped Y Drive path. Below is an example R script that shows how to do this. Note that these filepaths will only work when the scripts are run from a virtual machine that has Urban network access to the Y Drive (i.e., the two virtual machines specified above).

library(tidyverse)

# --- Read in Data ---

## We recommend using the full filepath which starts with
## `//ares/UI_Projects2/CENTER` as this will never change. Note you will need to
## replace CENTER and projectfolder with your respective values.
raw_data <- read_csv("//ares/UI_Projects2/CENTER/projectfolder/data/raw-data/ex.csv")

## Another acceptable filepath uses the "Y:" drive mapping built into 
## virtual desktops. This could change in the future.
raw_data <- read_csv("Y:/CENTER/projectfolder/data/raw-data/ex.csv")


# --- Write out Data ---
raw_data %>%
  write_csv("//ares/UI_Projects2/CENTER/projectfolder/data/intermediate-data/ex-cleaned.csv")