Chapter 14 - Strings
Load the libraries needed for these exercises.
library(tidyverse)
14.2 - String basics
Problem 1
In code that doesn’t use stringr
, you’ll often see paste()
and paste0()
. What’s the difference between the two functions? What stringr
function are they equivalent to? How do the functions differ in their handling of NA?
paste()
automatically includes a space between each character string it combines. paste0()
does not include a space. They are ~equivalent to str_c()
from library(stringr)
. paste()
and paste0()
include NA as text. str_c()
returns an NA for the entire string if the string contains an NA.
Problem 2
In your own words, describe the difference between the sep
and collapse
arguments to str_c()
.
sep
is a character string to insert between input vectors. Its input vector and output vector always have the same length.
length(str_c("Letter", letters, sep = ": "))
## [1] 26
collapse
is a character string to insert between input vectors and to turn the vector into a single string. collapse
always returns a vector with length one.
length(str_c("Letter", letters, collapse = ": "))
## [1] 1
Problem 3
Use str_length()
and str_sub()
to extract the middle character from a string. What will you do if the string has an even number of characters.
string_middle <- function(string) {
string_length <- str_length(string)
if (string_length %% 2 == 1) {
str_sub(string, floor((string_length + 1) / 2), ceiling((string_length) / 2))
}
else if (string_length %% 2 == 0) {
NULL
}
else {"Error!"}
}
string_middle("abc")
## [1] "b"
string_middle("abcd")
## NULL
It returned a NULL if string_length()
is even.
Problem 4
What does str_wrap()
do? When might you want to use it?
It implements the Knuth-Plass paragraph wrapping algorithm. It “breaks text paragraphs into lines, of total width - if it is possible - of at most given width
.
graph <- "It implements the Knuth-Plass paragraph wrapping algorithm. It breaks text paragraphs into lines, of total width - if it is possible - of at most given width."
str_wrap(graph, width = 20)
## [1] "It implements\nthe Knuth-Plass\nparagraph wrapping\nalgorithm. It breaks\ntext paragraphs\ninto lines, of total\nwidth - if it is\npossible - of at\nmost given width."
This could be useful for formatting in html and rmarkdown. Especially for graphics and sidebars. Custom width is useful - especially in reproducible documents.
Problem 5
What does str_trim()
do? What’s the opposite of str_trim()
?
It trims whitespace from the left, right, or both sides of a character string. It is the string version of trimws()
.
str_pad()
is the opposite of str_trim()
. It adds whitespace to the left, right, or both sides of a character string.
Problem 6
Write a function that turns (e.g.) a vector c("a", "b", "c")
into a string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.
list_maker <- function(string) {
if (length(string) > 1) {
stringa <- string[1:length(string)-1]
stringb <- string[length(string)]
stringa <- str_c(stringa, collapse = ", ")
str_c(stringa, ", and ", stringb, collapse = "")
} else {
string
}
}
string <- c("a", "b", "c", "d", "e")
list_maker(string)
## [1] "a, b, c, d, and e"
14.3 - Matching patterns with regular expressions {-}
Problem 1
Explain why each of these strings don’t match a : “",”\“,”\".
- “" escapes the quotation mark and isn’t a valid character string in R.
- “\” returns a character string with two backslashes which doesn’t match one backslash.
- “\" escapes the quotation mark and isn’t a valid character string.
Problem 2
How would you match the sequence “’?
Problem 3
What patterns will the regular expression ...... match? How would you represent it as a string?
It will match a string of three periods separated by characters. \\..\\..\\..
.
str_view(".a.b.c", "\\..\\..\\..")
14.3.2 - Anchors
Problem 1
How would you match the literal string “\(^\)”?
x <- "$^$"
str_view(x, "\\$\\^\\$")
Problem 2
Given the corpus of common words in stringr::words, create regular expressions that will find all words that:
- Start with “y”
str_view(words, "^y", match = TRUE)
- End with “x”
table(str_detect(words, "x$"))
##
## FALSE TRUE
## 976 4
- Are exactly three letters long. (Don’t cheat by using
str_length()
!)
table(str_detect(words, "^...$"))
##
## FALSE TRUE
## 870 110
- Have seven letters or more.
table(str_detect(words, "^......."))
##
## FALSE TRUE
## 761 219
14.3.2 - Character classes and alternatives
Problem 1
Create regular expressions to find all words that:
- Start with a vowel
str_view(words[1:10], "^[aeiou]", match = TRUE)
- Only contain consonants. (Hint: think about match “not”-vowels.)
str_view(words, "^[^aeiou]+$", match = TRUE)
I’m not sure if this can be done with +
which is introduced on page 204 after the exercises.
- End with ed, but not with eed.
str_view(words, "[^e]ed$", match = TRUE)
- End with ing or ize.
str_view(words, "ing$|ize$", match = TRUE)
Problem 2
Empirically verify the rule “i before e except after c.”
Let’s try this with proof by contradiction. We need to look for two conditions:
- ie after c
- ei
str_view(words, "ei|[c]ie", match = TRUE)
Six words violate the rules. “i before e except after c” is and always will be rubbish.
Problem 3
Is “q” always followed by a “u”?
Proof by contradiction: look for a “q” not followed by a “u”.
str_view(words, "q^[u]", match = TRUE)
Yes, “q” is always followed by a “u” in this data set.
Problem 4
Write a regular expression that matches a word if it’s probably written in British English, not American English.
str_view(words, "our|ise|ogue", match = TRUE)
Problem 5
Create a regular expression that will match telephone numbers as commonly written in your country.
phone <- c("212-555-7891", "(212)-555-7891")
str_view(phone, "\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d|\\(\\d\\d\\d\\)-\\d\\d\\d-\\d\\d\\d\\d", match = TRUE)
14.3.4 - Repetition
Problem 1
Describe the equivalents of ?, +, and * in {n, m} form.
? == {1}
+ == {1,}
* == {0,}
Problem 2
Describe in words what these regular expressions match (read carefully to if I’m using a regular expression or a string that defines a regular expressions):
^.*$
matches an entire string.^
matches the start of a string..
is any character which is repeated 0 or more times with*
.$
matches the end of a string."\\{.+\\}"
3.\d{4}-\d{2}-\d{2}
matches exactly 4 digits followed by a dash followed by exactly two digits followed by a dash followed by exactly two digits. This is the same as the ISO8601 date international standard.\\\\{4}
matches exactly four backslashes.
Problem 3
Create regular expressions to find all words that:
- Start with three consonants.
string <- c("scratch", "apple")
str_view(string, "^[^aeiou]{3}")
- Have three or more vowels in a row.
string <- c("scratch", "aaapple")
str_view(string, "^[aeiou]{3,}")
- Have two or more vowel consonant pairs in a row.
string <- c("banana", "coconut")
str_view(string, "([aeiou][^aeiou]){2,}")
Problem 4
Solve the beginner regexp crosswords at http://regexcrossword.com/challenges/beginner
14.3.5 - Grouping and backreferences
Problem 1
Describe in words what these expressions will match:
(.)\1\1
will match any string of three repeated letters or symbols."(.)(.)\\2\\1"
will match a four letter palindrome (spelled the same forwards and backwards).(..)\1
will match a four letter string where the second half is a reptition of the first half."(.).\\1.\\1"
will match and repetition of the same character three times where each character is spearated by a character (ex. “ababa” and “&&&&&”)."(.)(.)(.)*\\3\\2\\1"
will match a string of characters where the first three characters are repeted in reverse and the middle character can be repeated multiple times (ex. “abccba” and “abcccccba”).
Problem 2
0.0.0.0.1 2. Construct regular expressions to match words that:
- Start and end with the same character.
"^(.).*\\1$"
- Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice)
".*(..).*\\1.*"
- Contain one letter repeated in at least three places (e.g., “eleven” contains three “e”s).
".*(.).*\\1.*\\1.*"
14.4 - Tools
Problem 1
For each of the following challenges, try solving it by using both a singular regular expression, and a combination of multiple str_detect()
calls: 1. Find all words that start of end with x. str_detect(words, "^x.*x$")
& str_detect(str_detect(words, "^x"), "x$")
2. Find all words that start with a vowel and end with a consonant. str_detect(words, "^[aeiou].*[^aeiou]$")
& str_detect(str_detect(words, "^[aeiou]"), "[^aeiou]$")
3. Are there any words that contain at least one of each different vowel? TODO(aaron): hmm? 4. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator)
as_tibble(words) %>%
mutate(vowels = str_count(value, "[aeiou]")) %>%
filter(vowels == max(vowels))
## # A tibble: 8 x 2
## value vowels
## <chr> <int>
## 1 appropriate 5
## 2 associate 5
## 3 available 5
## 4 colleague 5
## 5 encourage 5
## 6 experience 5
## 7 individual 5
## 8 television 5
as_tibble(words) %>%
mutate(letters = str_count(value),
vowels = str_count(value, "[aeiou]"),
proportion = vowels / letters) %>%
filter(proportion == max(proportion))
## # A tibble: 1 x 4
## value letters vowels proportion
## <chr> <int> <int> <dbl>
## 1 a 1 1 1
14.4.3 - Extract Matches
Problem 1
In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a color. Modify the regex to fix the problem.
colors <- "\\b(red|orange|yellow|green|blue|purple)\\b"
more <- sentences[str_count(sentences, colors) > 1]
str_view_all(more, colors)
Problem 2
From the Harvard sentences data, extract: 1. The first word from each sentence. str_extract(sentences, "[^\\s]*")
2. All words ending in ing. str_extract_all(sentences, "\\b[^\\s]*ing\\b")
3. All plurals. TODO(aaron): hmm?
14.4.4 - Grouped matches
Problem 1
Find all words that come after a “number” like “one”, “two”, “three”, etc. Pull out the number and the word.
numbers <- "([Oo]ne|[Tt]wo|[Tt]hree|[Ff]our|[Ff]ive|[Ss]ix|[Ss]even|[Ee]ight|[Nn]ine|[Tt]en) ([^ ]+)"
tibble(sentence = sentences) %>%
extract(
sentence, c("number", "word"), numbers,
remove = FALSE
) %>%
filter(!is.na(number))
## # A tibble: 46 x 3
## sentence number word
## <chr> <chr> <chr>
## 1 Rice is often served in round bowls. ten served
## 2 Four hours of steady work faced us. Four hours
## 3 Two blue fish swam in the tank. Two blue
## 4 Lift the square stone over the fence. one over
## 5 The rope will bind the seven books at once. seven books
## 6 The two met while playing on the sand. two met
## 7 There are more than two factors here. two factors
## 8 He lay prone and hardly moved a limb. one and
## 9 Ten pins were set in order. Ten pins
## 10 Type out three lists of orders. three lists
## # ... with 36 more rows
Problem 2
Find all contractions. Separate out the pieces before and after the apostrophe.
"[^ ]*'[^ ]*"
could be used, but it returns possessive nouns. The following string of regular expressions gets around this problem.
contractions <- "[^ ]*'m|[^ ]*n't|[^ ]*'ve|[^ ]*'d|[^ ]*'re|[^ ]*'ll|[Ll]et's|[Ss]he's|[Hh]e's"
tibble(sentence = sentences) %>%
mutate(contraction = str_extract(sentences, contractions)) %>%
filter(!is.na(contraction)) %>%
extract(contraction, c("before", "apostrophe", "after"), "(.*)(')(.*)")
## # A tibble: 4 x 4
## sentence before apostrophe after
## <chr> <chr> <chr> <chr>
## 1 Open the crate but don't break the glass. don ' t
## 2 Let's all join as we sing the last chorus. Let ' s
## 3 We don't get much money but we have fun. don ' t
## 4 We don't like to admit our small faults. don ' t
14.4.5 - Replacing matches
Problem 1
Replace all forward slashes in a string with backslashes. str_replace_all("a/b/c", "/", "\\\\")
Problem 2
Implement a simple version of str_to_lower()
using str_replace_all()
. str_replace_all("AbC", "[A-Z]", tolower)
Problem 3
Switch the first and last letters in words. Which of those strings are still words?
new.words <- str_replace(words, "(^.)(.*)(.$)", "\\3\\2\\1")
words[new.words %in% words]
## [1] "a" "america" "area" "dad" "dead"
## [6] "deal" "dear" "depend" "dog" "educate"
## [11] "else" "encourage" "engine" "europe" "evidence"
## [16] "example" "excuse" "exercise" "expense" "experience"
## [21] "eye" "god" "health" "high" "knock"
## [26] "lead" "level" "local" "nation" "no"
## [31] "non" "on" "rather" "read" "refer"
## [36] "remember" "serious" "stairs" "test" "tonight"
## [41] "transport" "treat" "trust" "window" "yesterday"
14.4.6 - Splitting
Problem 1
Split up a string like “apples, pears, and bananas” into individual components.
str_split("apples, pears, and bananas", boundary("word"))
Problem 2
Why is it better to split up by boundary(“word”) than " “?
" "
captures non-words like the space after the period while boundary("word")
only captures words.
Problem 3
What does splitting with an empty string (“”) do? Experiment and read the documentation.
“An empty pattern,”“, is equivalent to boundary(”character“).”
14.5 - Other types of pattern
Problem 1
How would you find all strings containing “" with regex() versus fixed. regex("\\\\")
& fixed("\")
Problem 2
What are the five most common words in setences?
The five most common words are “the”, “a”, “of”, “to”, and “and”.
str_split(sentences, boundary("word")) %>%
flatten_chr() %>%
str_to_lower() %>%
as_tibble() %>%
group_by(value) %>%
count() %>%
arrange(-n) %>%
ungroup() %>%
top_n(5)
## Selecting by n
## # A tibble: 5 x 2
## value n
## <chr> <int>
## 1 the 751
## 2 a 202
## 3 of 132
## 4 to 123
## 5 and 118
14.6 - Other uses of regular expressions
Problem 1
Find the stringi function that: 1. Count the number of words stri_count_words
2. Find duplicated strings. stri_duplicated()
3. Generate random text. stri_rand_strings()
Problem 2
How do you control the language that str_sort() uses for sorting? With the locale =
argument in the opts_collator
argument.