I just wrote an entry on my Biochem blog which I think would fit on this site:
Default preferences
I enjoy using R and RStudio, but I am always weary of upgrading R because that usually leads to some issue(s). The most recent was a bit long for me to diagnose, even though in retrospect it is a simple change in default preference, namely changing the default setting default.stringsAsFactors()
from TRUE
in R versions 3.x to FALSE
in the newest R versions 4.0.x. This setting is part of functions used to read data into R from tabular text, for example read.table()
for tab-delimited text or read.csv()
for comma-delimited text files.
One would hope that a script with such basic commands should continue to work when upgrading to a new version, as reading data should not change the nature of things without warning! However, this little change caused a simple command to not work anymore in one of the R tutorial “Demo yeast mutant analysis.”
The commands were simple: read tab-delimited text file (yeast_example.txt) and then make a plot (which should automatically be made as a boxplot.)
yeast_eg = read.table('yeast_example.txt'), header = T)
with(yeast_eg, plot(genotype, OD_change))
However, now with R 4.0.x default installation this command now fails.
Solution 1
When things did not work I was puzzled and since I had not used R for a while I simply went to search online and found a suitable command that requires loading a library. The purpose of the command was to convert the character/words/string entries within a column of tabular data into a factor.
library(dplyr)
yeast_eg <- mutate_if(yeast_eg, is.character, as.factor)
This library is part of the great new ways of using R thanks to the Tidyverse but it adds a level of complexity that was not warranted before the upgrade of R to version 4.x.
Solution 2
I happened to notice default.stringsAsFactors()
within the help file for read.table()
and that led me to discover the change from R 3.x (default of TRUE
) to R 4.0.x (default of FALSE
.) The statement within the function reads stringsAsFactors = default.stringsAsFactors()
One way to change the default is to explicitly make this TRUE
while using the read.table()
command, for example:
yeast_eg = read.table('yeast_example.txt', header = T,
stringsAsFactors = T
)
However,
would have to be repeated each time. But there is a way to change the behavior by changing the default with the command stringsAsFactors = T
options(stringsAsFactors=FALSE)
as discussed (pro and cons) on this stack overflow article “Change stringsAsFactors settings for data.frame“.
Prior announcement
Now that I know this I found this article: stringsAsFactors (2020/02/16)
The article provides a historical retrospective of the reasons why this change was not present, then added, and now again removed from R with good reasons, stating that “Automatic string to factor conversion introduces non-reproducibility.[…] Hence, the results of subsequent statistical analyses can differ with automatic string-to-factor conversion in place.”
Defaults
Computer software have the great flexibility to providing (convenient, or sometimes annoying) defaults, which users can arrange as they wish. However, the new settings may make one’s own version very different than that of others, and therefore create a loss of consistency. This experienced software engineer explains it better in his post “The pros and cons of “defaults””
Variable types
What are factors in R?
The Berkeley statistical department provides this answer as part of a longer article: “Conceptually, factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables. ”
The difference and definitions between numerical as well as categorical variables are well summarized on this short page about variable types on the S.O.S ( Statistics Online Support, University of Texas.)