4  Data Import

Author

R Users Group

Published

September 25, 2024

4.1 Review

  • What is tidy data?

    Artwork by @allison_horst

  • What is a file path?

  • What is an R Studio Project?

  • Stay DRY: if you find yourself copying and pasting chunks of code, write a function instead.

  • Writing functions that use capabilities of the tidyverse can be tricky due to something called Nonstandard Evaluation - feel free to ask any of us for help!

4.2 Motivation

  • diamonds, storms, or starwars are nice examples, but how do I load data for an actual research project?

  • Data import is a crucial yet often overlooked step of truly replicable and reproducible research.

4.3 Packages

We’ll be using three packages from the tidyverse for data import and export. These three should cover just about all of your use cases at Urban.

  • readr - for plain text files (.csv, .tsv)
  • readxl - for Excel files (.xlsx, .xls)
  • haven - for SAS (.sas7bdat + .sas7bcat files), SPSS (.sav files), and Stata (.dta files up to Stata version 17) formats

One note - while all three packages are installed with the tidyverse, only readr is loaded when you run library(tidyverse) in your R session. The other packages must be loaded separately.

4.4 Primary Functions

In general, readr, readxl, and haven will come with some variant of a read function for data import, and a variant of write (except for Excel and SAS!) for data export.

  • readr::read_csv(), readr::write_csv()
  • readr::read_delim(), readr::write_delim()
  • readxl::read_excel()
  • haven::read_sas()
  • haven::read_dta(), haven::write_dta

For the simplest use cases, the various read functions will need just one argument: the path to the file you want to load.

df <- read_csv("path/to/your/original/data.csv")

The various write functions will need two arguments: the data frame you want to export, and the file path to save it to.

write_csv(df, "path/to/your/modified/data.csv")

4.5 Setup

4.5.1 Exercise

Step 1: Open up your RStudio project, open a R script, save it with a meaningful name, and load the tidyverse, readxl, haven, and arrow.

Step 2: Use the dir.create() function to create subfolders called data and modified in your R Project.

Step 3: We’ll be downloading a couple different datasets for today’s examples. Staying DRY, let’s rework the example code from Wednesday into a function we can reuse. Copy and paste the following into your file. Make sure to run the code so the function becomes a part of your environment.

download_data <- function(url, path) {
  if (!file.exists(path)) {
    download.file(url, path, mode = "wb")
  }
}

4.6 The readr Package

readr provides functions to read plain text rectangular data - think csv, txt or tsv files.

4.6.1 CSV Files

4.6.1.1 Example

First we’ll download an example csv file from the Urban Institute’s Education data portal:

# Download ipeds data into data/ folder
download_data("https://educationdata.urban.org/csv/ipeds/colleges_ipeds_completers.csv",
              "data/colleges_ipeds_completers.csv")

And then use read_csv from the readr package to load the data in.

ipeds <- read_csv("data/colleges_ipeds_completers.csv")

We now have a tibble that can be used with all the ggplot2 and dplyr functions you’ve learned so far. Let’s filter the data to only 2011 and write it out as a separate file:

ipeds_2011 <- ipeds %>%
  filter(year == 2011)

write_csv(ipeds_2011, "modified/colleges_ipeds_completers_2011.csv")

4.6.1.2 Exercise

Step 1: Filter the ipeds data frame to years 2014-2015 for the state of California (HINT: fips code of 6). Be sure to use <- to save it to a new object!

Step 2: Write the filtered data frame to a file called “ipeds_completers_ca.csv” in your modified folder.

4.6.2 Other Delimiters

Most plain text files will use a comma to separate values. Sometimes you’ll see other delimiters used - the most common are tabs \t and vertical bar | (which is often called a pipe, not to be confused with the tidyverse pipe %>%).

To read in a plain text file that uses another delimiter, use the read_delim function. This function will take two arguments: the path to the file you want to load and the character to use as a delimiter.

4.6.2.1 Example

First we’ll download a tab delimited file from the North Carolina Board of Elections:

download_data("https://s3.amazonaws.com/dl.ncsbe.gov/data/ncvhis30.zip",
              "data/ncvhis30.zip")

Then load in the file with the correct delimiter set. It may be helpful to open delimtied file with a program like Notepad so you can see the plain text in the file and decide what delimiter to use. In this case we know its a tab:

ncvhis <- read_delim("data/ncvhis30.zip", delim = "\t")

Note that readr was able to read the data directly from the zipfile!

Similarly, the write_delim() function can be used to write out a plain text file with a specified delim argument.

4.6.2.2 Exercise

Step 1: Use write_delim() to write out ncvhis to a pipe | separated file.

4.6.3 Common Issues

Not all data import will be quick and easy. Often times you will run into issues - these can be daunting at first but are often not as bad as they seem.

4.6.3.1 Example

The readr package provides a challenge.csv file that will throw numerous parsing issues.

challenge <- read_csv(readr_example("challenge.csv"))

You can use the problems function to get a tibble of these issues.

issues <- problems(challenge)

A thousand parsing failures sounds like a lot! But it’s really just one error a thousand times, and not a thousand unique errors to work through.

The issue is that readr will read the first thousand lines of a file and use that as the column specification. This file changes data types at line 1001. This is a very common issue when reading text files into R. We can use the guess_max argument to increase the number of lines readr will use to set the column types.

challenge <- read_csv(readr_example("challenge.csv"),
                      guess_max = Inf)

You will run into other problems when importing data like:

  • the columns not being in the first row
  • weird NA values
  • blank rows at the top of files

Luckily read_csv() and its friends have a lot of built in arguments to help deal with those issues. Look at the arguments in the docs using ?read_csv or go to the R4DS data import page for examples of how to deal with all the above problems.

4.6.4 Exercise

Step 1: Pull up the documentation for read_csv and take a look of some of the function arguments. There are a lot! You don’t need to know how everything works, but it’s good to have an idea of some of the options available to you.

4.7 The readxl Package

For data that is saved as .xls or .xlsx, use the readxl package.

4.7.1 Example

For this example we’ll download data from the HUD FHA Single Family Portfolio Snap Shot.

download_data("https://www.hud.gov/sites/dfiles/Housing/documents/FHA_SFSnapshot_Apr2019.xlsx",
              "data/sfsnap.xlsx")

This Excel file contains a number of tables on different sheets of the workbook. We can see a listing of the sheets using the excel_sheets function. As usual, the first argument is the path to the data file.

excel_sheets("data/sfsnap.xlsx")

We can then use the read_excel function to load our data into our R session. Again, the first argument is the path to the .xlsx file. We also use the sheet argument to specify which sheet of the workbook we want.

purchases <- read_excel("data/sfsnap.xlsx", sheet = "Purchase Data April 2019")

Note that the readxl package only provides functionality to, as the name implies, read Excel files. If you need to write your data out to .xls or .xlsx formats - don’t! We highly recommend saving data files as .csv files wherever possible. But if you you really need to, look into the writexl or openxlsx packages.

4.7.2 Exercise

Step 1: Use read_excel() to load in the table on the “Refinance Data April 2019” sheet into a data frame called refinances.

4.8 The haven Package

Finally, your SAS, SPSS, or Stata data can find safe haven in R.

4.8.1 Stata Data (.dta)

4.8.1.1 Example

First we’ll download and unzip the 2018 General Social Survey, which comes as a zipped .dta file.

download_data("http://gss.norc.org/Documents/stata/2018_stata.zip",
              "data/gss.zip")

unzip("data/gss.zip", exdir = "data/")

To read a .dta file, we’ll use the aptly named read_dta function. As usual, the first argument to the function is the path to the data we want to import.

gss <- read_dta("data/GSS2018.dta")

# We can also read in the zip file directly!
gss <- read_dta("data/gss.zip")

4.8.2 SAS Data (.sas7bdat)

We use read_sas to read in SAS files. This works mostly identical to read_stata, and you can read in .sas7bdat + .sas7bcat, or zipped folders with sas files. Note that while the write_sas() function exists, it doesn’t appear to actually work at the current version of haven.

4.8.2.1 Example

Download and read in the 2017 1-year Public Use Microsample for Wyoming from the Census:

download_data("https://www2.census.gov/programs-surveys/acs/data/pums/2017/1-Year/unix_pwy.zip",
              "data/pums_wy.zip")

# Unzip the files
unzip("data/pums_wy.zip", exdir = "data/")

# Read in sas file
pums_data = read_sas("data/pums_wy.zip")

# Write out as dta file
pums_data %>% write_dta("modified/pums_wy.dta")

4.9 The arrow Package

Sometimes when you save text files, the column types (ie numeric, date, time, text, etc.) get lost and won’t be read back in upon reimporting In this case you may want to save your files as a binary file. One binary file format we suggest is .feather files, which you can read more about. The main benefits of storing data in a binary format is:

  • Column types will be recorded and preserved
  • Importing and exporting will be faster than other methods
  • File sizes will be smaller as binary files often use compression methods

The cons are that:

  • .feather files can only easily be read into R and python currently
  • There are no nice GUI viewers (like Excel/Notepad for CSV data) to actually “see” the data files

So think before exercising this option! If you mainly will be working with your data in R and have large file sizes or complicated column types, .feather files may be a good option.

You can read and write these files with read_feather() and write_feather respectively.

# Write out a feather file to `modified/folder`
write_feather(ipeds_2011, "modified/colleges_ipeds_completers_2011.feather")

# Read in a feather file
ipeds_2011 = read_feather("modified/colleges_ipeds_completers_2011.feather")

4.10 Resources