Chapter 11 - Data Import
Load the libraries needed for these exercises.
library(tidyverse)
11.2 - Getting Started
Problem 1
What function would you use to read a file where fields were separated with “|”?
Use read_delim()
, using |
as the delimiter:
data <- 'a|b|c\n1|2|3'
read_delim(data, delim = '|')
## # A tibble: 1 x 3
## a b c
## <int> <int> <int>
## 1 1 2 3
Problem 2
Apart from file
, skip
, and comment
, what other arguments do read_csv()
and read_tsv()
have in common?
read_csv()
and read_tsv()
are essentially just read_delim
with the delimiter preset to either a comma or a tab. All of their other arguments are the same.
Problem 3
What are the most important arguments to read_fwf()
?
The most important argument to read_fwf()
is col_positions
, as this determines how data is read:
fwf_sample <- readr_example("fwf-sample.txt")
cat(read_lines(fwf_sample))
## John Smith WA 418-Y11-4111 Mary Hartford CA 319-Z19-4341 Evan Nolan IL 219-532-c301
read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
## Parsed with column specification:
## cols(
## name = col_character(),
## state = col_character(),
## ssn = col_character()
## )
## # A tibble: 3 x 3
## name state ssn
## <chr> <chr> <chr>
## 1 John Smith WA 418-Y11-4111
## 2 Mary Hartford CA 319-Z19-4341
## 3 Evan Nolan IL 219-532-c301
read_fwf(fwf_sample, fwf_widths(c(5, 10, 12), c("name", "state", "ssn")))
## Parsed with column specification:
## cols(
## name = col_character(),
## state = col_character(),
## ssn = col_character()
## )
## # A tibble: 3 x 3
## name state ssn
## <chr> <chr> <chr>
## 1 John Smith WA
## 2 Mary Hartford CA
## 3 Evan Nolan IL
Problem 4
Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like "
or '
. By convention, read_csv()
assumes that the quoting character will be "
, and if you want to change it you’ll need to use read_delim()
instead. What arguments do you need to specify to read the following text into a data frame?
"x,y\n1,'a,b'"
Since read_delim()
must be used instead of read_csv()
, the delimiter must be set. The quote
argument can be set to a single quote instead of a double quote:
data <- "x,y\n1,'a,b'"
read_delim(data, delim = ',', quote = '\'')
## # A tibble: 1 x 2
## x y
## <int> <chr>
## 1 1 a,b
Problem 5
Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
- There are more data than columns, which results in a parsing failure. The extra data are dropped from the data frame:
read_csv("a,b\n1,2,3\n4,5,6")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 <NA> 2 columns 3 columns literal data file 2 2 <NA> 2 columns 3 columns literal data
## # A tibble: 2 x 2
## a b
## <int> <int>
## 1 1 2
## 2 4 5
- There is too little data in row 2 and too much data in row 3 - row 2 is filled in with a missing value while row 3 drops data:
read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 <NA> 3 columns 2 columns literal data file 2 2 <NA> 3 columns 4 columns literal data
## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 NA
## 2 1 2 3
- There are two variables but only one data point -
b
is filled in with a missing value:
read_csv("a,b\n\"1")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 a closing quote at end of file "" literal data file 2 1 <NA> 2 columns 1 columns literal data
## # A tibble: 1 x 2
## a b
## <int> <chr>
## 1 1 <NA>
- Appears that the header was entered twice, so the data are parsed as character instead of a string. Or if the goal here was to enter a missing value
NA
, note that then
was processed as a new line\n
.
read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
read_csv("a,b\n1,2\nna,b")
## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 na b
read_csv()
has a delimiter set to,
, useread_csv2()
instead:
read_csv("a;b\n1;3")
## # A tibble: 1 x 1
## `a;b`
## <chr>
## 1 1;3
read_csv2("a;b\n1;3")
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## # A tibble: 1 x 2
## a b
## <int> <int>
## 1 1 3
11.3 - Parsing a Vector
Problem 1
What are the most important arguments to locale()
?
The date_names
argument provides useful defaults for a locale()
object:
dutch <- locale('nl')
japanese <- locale('ja')
str(dutch)
## List of 7
## $ date_names :List of 5
## ..$ mon : chr [1:12] "januari" "februari" "maart" "april" ...
## ..$ mon_ab: chr [1:12] "jan." "feb." "mrt." "apr." ...
## ..$ day : chr [1:7] "zondag" "maandag" "dinsdag" "woensdag" ...
## ..$ day_ab: chr [1:7] "zo" "ma" "di" "wo" ...
## ..$ am_pm : chr [1:2] "a.m." "p.m."
## ..- attr(*, "class")= chr "date_names"
## $ date_format : chr "%AD"
## $ time_format : chr "%AT"
## $ decimal_mark : chr "."
## $ grouping_mark: chr ","
## $ tz : chr "UTC"
## $ encoding : chr "UTF-8"
## - attr(*, "class")= chr "locale"
str(japanese)
## List of 7
## $ date_names :List of 5
## ..$ mon : chr [1:12] "1月" "2月" "3月" "4月" ...
## ..$ mon_ab: chr [1:12] "1月" "2月" "3月" "4月" ...
## ..$ day : chr [1:7] "日曜日" "月曜日" "火曜日" "水曜日" ...
## ..$ day_ab: chr [1:7] "日" "月" "火" "水" ...
## ..$ am_pm : chr [1:2] "午前" "午後"
## ..- attr(*, "class")= chr "date_names"
## $ date_format : chr "%AD"
## $ time_format : chr "%AT"
## $ decimal_mark : chr "."
## $ grouping_mark: chr ","
## $ tz : chr "UTC"
## $ encoding : chr "UTF-8"
## - attr(*, "class")= chr "locale"
Be sure to read the full documentation for locale()
. Common data import issues can probably be solved with decimal_mark
, grouping_mark
, and/or encoding
.
Problem 2
What happens if you try and set decimal_mark
and grouping_mark
to the same character? What happens to the default value of grouping_mark
when you set decimal_mark
to “,”
? What happens to the default value of decimal_mark
when you set the grouping_mark to “.”
?
locale()
requires that decimal_mark
and grouping_mark
be different:
x <- locale(decimal_mark = '.', grouping_mark = '.')
## Error: `decimal_mark` and `grouping_mark` must be different
Setting decimal_mark
to ,
will automatically update grouping_mark
to .
. Similarly setting grouping_mark
to .
will automatically update decimal_mark
to ,
:
x <- locale(decimal_mark = ',')
x$grouping_mark
## [1] "."
y <- locale(grouping_mark = '.')
y$decimal_mark
## [1] ","
Problem 3
I didn’t discuss the date_format
and time_format
options to locale()
. What do they do? Construct an example that shows when they might be useful.
A specific date_format
and time_format
structure can be specified in a locale()
. This can be useful for formatting data with non-standard formatting:
parse_date('June/7/90', locale = locale(date_format = '%B/%d/%y'))
## [1] "1990-06-07"
parse_time('1:15PM', locale = locale(time_format = '%I:%M%p'))
## 13:15:00
Problem 4
If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
Create a locale with the time zone updated:
str(locale(tz = 'US/Eastern'))
## List of 7
## $ date_names :List of 5
## ..$ mon : chr [1:12] "January" "February" "March" "April" ...
## ..$ mon_ab: chr [1:12] "Jan" "Feb" "Mar" "Apr" ...
## ..$ day : chr [1:7] "Sunday" "Monday" "Tuesday" "Wednesday" ...
## ..$ day_ab: chr [1:7] "Sun" "Mon" "Tue" "Wed" ...
## ..$ am_pm : chr [1:2] "AM" "PM"
## ..- attr(*, "class")= chr "date_names"
## $ date_format : chr "%AD"
## $ time_format : chr "%AT"
## $ decimal_mark : chr "."
## $ grouping_mark: chr ","
## $ tz : chr "US/Eastern"
## $ encoding : chr "UTF-8"
## - attr(*, "class")= chr "locale"
Problem 5
What’s the difference between read_csv()
and read_csv2()
?
read_csv()
has a delimiter set to ,
while read_csv2
is set to ;
, as some countries use ,
as the decimal_mark
:
read_csv('a,b\n1,2')
## # A tibble: 1 x 2
## a b
## <int> <int>
## 1 1 2
read_csv2('a;b\n1;2')
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## # A tibble: 1 x 2
## a b
## <int> <int>
## 1 1 2
Problem 6
What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
See this list of common encodings (via Wikipedia).
Problem 7
Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
Build up a datetime format using the pieces described in the chapter:
parse_date(d1, format = '%B %d, %Y')
## [1] "2010-01-01"
parse_date(d2, format = '%Y-%b-%d')
## [1] "2015-03-07"
parse_date(d3, format = '%d-%b-%Y')
## [1] "2017-06-06"
parse_date(d4, format = '%B %d (%Y)')
## [1] "2015-08-19" "2015-07-01"
parse_date(d5, format = '%m/%d/%y')
## [1] "2014-12-30"
parse_time(t1, format = '%H%M')
## 17:05:00
parse_time(t2, format = '%I:%M:%OS %p')
## 23:15:10.12