User:MPopov (WMF)/Notes/R 4.x vs 3.x

From Meta, a Wikimedia project coordination wiki

These notes are mainly for deciding which version of R to install on their laptops, since the version of R on the analytics cluster is tied to each host's version of Debian:

  • stat1004, stat1006, and stat1007 have Debian Stretch so they're stuck with R 3.3[1]
  • stat1005 and stat1008 have Debian Buster, so they're more up-to-date with R 3.5

So if you're managing your own computing environment and are trying to figure out whether to use 4.0 or 3.6 (the most recent pre-4.0 version), hopefully this will help.

Changes[edit]

From https://cran.r-project.org/doc/manuals/r-devel/NEWS.html:

  • R now uses a stringsAsFactors = FALSE default, and hence by default no longer converts strings to factors in calls to data.frame() and read.table().
  • There is a new syntax for specifying raw character constants similar to the one used in C++: r"(...)" with ... any character sequence not containing the sequence )". This makes it easier to write strings that contain backslashes or both single and double quotes. For more details see ?Quotes.
  • The palette() function has a new default set of colours (which are less saturated and have better accessibility properties). There are also some new built-in palettes, which are listed by the new palette.pals() function.

There are other changes, of course, but for the most part they're either:

  • performance improvements (especially with how R manages memory)
  • nitty-gritty changes that affect package authors more than non-package-authoring users

stringsAsFactors[edit]

The first point is easy to grasp. Prior to 4.0, reading files in with read.csv() and converting objects to data.frames would, by default, convert strings to factors. As Tim Smith writes in aRrgh: A newcomer's (angry) guide to R:[2]

String values are, by default, treated as factors, not as character atomic vectors. If the strings in your data file describe elements of a set of discrete possibilities, this is often convenient and desirable. If you are not expecting it, this behavior may threaten to ruin your career. Pass stringsAsFactors=FALSE to leave your character vectors alone.

In 4.0, that's no longer needed. Of course, if you've switched your workflow to tidyverse and make extensive use of {readr}'s read_csv and {tibble}'s as_tibble functions, you haven't had to worry about stringsAsFactors.

Writing strings[edit]

One of the biggest frustrations when working with strings in R has been backslashes. In R 3.6 and earlier, it would think you're trying to use escape sequences:

# R 3.3.3
> (x <- c("test\string", "test string"))
Error: '\s' is an unrecognized escape in character string starting ""test\s"

If you wanted to have a string like "test\string", you would need to escape the backslash:

# R 3.3.3
> (x <- c("test\\string", "test string"))
[1] "test\\string" "test string"
# R 3.3.3
> cat(x, sep="\n")
test\string
test string

Now, if we wanted to do any pattern matching, it'd get…rather tricky:

# R 3.3.3
> grepl("\s", x)
Error: '\s' is an unrecognized escape in character string starting ""\s"

When we escape it, it becomes \s (whitespace character class in regular expressions):

# R 3.3.3
> grepl("\\s", x)
[1] FALSE  TRUE

So the second string (the one with a space in it) is a match for it. If we wanted to look for the actual, literal "\s" we would need to, like, double-escape:

# R 3.3.3
> grepl("\\\\s", x)
[1]  TRUE FALSE

As you can see, "\\\\s" matches "\s". Not great, right?!? Hence the new system in 4.0:

# R 4.0.0
> (x <- c(r"(test\string)", "test string"))
[1] "test\\string" "test string"

> cat(x, sep = "\n")
test\string
test string

And matching is easier too:

# R 4.0.0
> grepl(r"(\s)", x)
[1] FALSE  TRUE

> grepl(r"(\\s)", x)
[1]  TRUE FALSE

Doesn't that look nicer?

Quotes[edit]

Another benefit is mixing double and single quotes. In R 3.6 and earlier, you would need to escape whichever quotes you used to enclose the string. For example:

# R 3.3.3
> ""test" and 'test'"
Error: unexpected symbol in """test"

> "\"test\" and 'test'"
[1] "\"test\" and 'test'"

But the new r"(...)" system lets you do the following:

# R 4.0.0
> r"("test" and 'test')"
[1] "\"test\" and 'test'"

SO. NICE.

Conclusion[edit]

The string stuff is a major quality-of-life improvement and if that totally sold you on 4.x, go for it. For now, though, I would recommend sticking to 3.x because not all packages may be compatible with the new version. As I mentioned earlier, there are a lot of lower-level changes that affect package authors. In some cases, packages are totally fine as-is but in others they might need to be updated by their maintainers. If you're not itching for that new r"(...)" feature and you're unsure about the compatibility of some packages you rely on, then you may want to wait like half a year or a year.

References[edit]