stringr: modern, consistent string processing

String processing is not glamorous, but it is frequently used in data cleaning and preparation. The existing string functions in R are powerful, but not friendly. To remedy this, the stringr package provides string functions that are simpler and more consistent, and also fixes some functionality that R is missing compared to other programming languages.

Hadley Wickham (Department of Statistics, Rice University)
2010-12-01

1 Introduction

Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages, so that some things that are easy to do in languages like Ruby or Python are rather hard to do in R. The stringr package aims to remedy these problems by providing a clean, modern interface to common string operations.

More concretely, stringr:

To meet these goals, stringr provides two basic families of functions:

These are described in more detail in the following sections.

2 Basic string operations

There are three string functions that are closely related to their base R equivalents, but with a few enhancements:

Three functions add new functionality:

3 Pattern matching

stringr provides pattern matching functions to detect, locate, extract, match, replace, and split strings:

Figure 1 shows how the simple (single match) form of each of these functions work.

library(stringr)
strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569", 
 "387 287 6718", "apple", "233.398.9187  ", "482 952 3315", "239 923 8115",
 "842 566 4692", "Work: 579-499-7527", "$1000", "Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

# Which strings contain phone numbers?
str_detect(strings, phone)
strings[str_detect(strings, phone)]

# Where in the string is the phone number located?
loc <- str_locate(strings, phone)
loc 
# Extract just the phone numbers
str_sub(strings, loc[, "start"], loc[, "end"])
# Or more conveniently:
str_extract(strings, phone)

# Pull out the three components of the match
str_match(strings, phone)

# Anonymise the data
str_replace(strings, phone, "XXX-XXX-XXXX")
Figure 1: Simple string matching functions for processing a character vector containing phone numbers (among other things).

Arguments

Each pattern matching function has the same first two arguments, a character vector of strings to process and a single pattern (regular expression) to match. The replace functions have an additional argument specifying the replacement string, and the split functions have an argument to specify the number of pieces.

Unlike base string functions, stringr only offers limited control over the type of matching. The fixed() and ignore.case() functions modify the pattern to use fixed matching or to ignore case, but if you want to use perl-style regular expressions or to match on bytes instead of characters, you’re out of luck and you’ll have to use the base string functions. This is a deliberate choice made to simplify these functions. For example, while grepl has six arguments, str_detect only has two.

Regular expressions

To be able to use these functions effectively, you’ll need a good knowledge of regular expressions (Friedl 1997), which this paper is not going to teach you. Some useful tools to get you started:

When writing regular expressions, I strongly recommend generating a list of positive (pattern should match) and negative (pattern shouldn’t match) test cases to ensure that you are matching the correct components.

Functions that return lists

Many of the functions return a list of vectors or matrices. To work with each element of the list there are two strategies: iterate through a common set of indices, or use mapply to iterate through the vectors simultaneously. The first approach is usually easier to understand and is illustrated in Figure 2.

library(stringr)
col2hex <- function(col) {
  rgb <- col2rgb(col)
  rgb(rgb["red", ], rgb["green", ], rgb["blue", ], max = 255)
}

# Goal replace colour names in a string with their hex equivalent
strings <- c("Roses are red, violets are blue", "My favourite colour is green")

colours <- str_c("\\b", colors(), "\\b", collapse="|")
# This gets us the colours, but we have no way of replacing them
str_extract_all(strings, colours)

# Instead, let's work with locations
locs <- str_locate_all(strings, colours)
sapply(seq_along(strings), function(i) {
  string <- strings[i]
  loc <- locs[[i]] 
  
  # Convert colours to hex and replace
  hex <- col2hex(str_sub(string, loc[, "start"], loc[, "end"]))
  str_sub(string, loc[, "start"], loc[, "end"]) <- hex
  string
})
Figure 2: A more complex situation involving iteration through a string and processing matches with a function.

4 Conclusion

stringr provides an opinionated interface to strings in R. It makes string processing simpler by removing uncommon options, and by vigorously enforcing consistency across functions. I have also added new functions that I have found useful from Ruby, and over time, I hope users will suggest useful functions from other programming languages. I will continue to build on the included test suite to ensure that the package behaves as expected and remains bug free.


CRAN packages used

stringr

CRAN Task Views implied by cited packages

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

J. E. Friedl. Mastering Regular Expressions. O’Reilly, 1997. URL http://oreilly.com/catalog/9781565922570.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wickham, "stringr: modern, consistent string processing", The R Journal, 2010

BibTeX citation

@article{RJ-2010-012,
  author = {Wickham, Hadley},
  title = {stringr: modern, consistent string processing},
  journal = {The R Journal},
  year = {2010},
  note = {https://rjournal.github.io/},
  volume = {2},
  issue = {2},
  issn = {2073-4859},
  pages = {38-40}
}