The flexibility of R and the diversity of the R community leads to a large number of programming styles applied in R packages. We have analyzed 108 million lines of R code from CRAN and quantified the evolution in popularity of 12 style-elements from 1998 to 2019. We attribute 3 main factors that drive changes in programming style: the effect of style-guides, the effect of introducing new features, and the effect of editors. We observe in the data that a consensus in programming style is forming, such as using lower snake case for function names (e.g. softplus_func) and <- rather than = for assignment.
R is flexible. For example, one can use <-
or =
as assignment
operators. The following two functions can both be correctly evaluated.
sum_of_square <- function(x) {
return(sum(x^2))
}
sum_oF.square=function(x)
{
sum(x ^ 2)}
One area that can highlight this flexibility is naming conventions. According to the previous research by Bååth (2012), there are at least 6 styles and none of the 6 has dominated the scene. Beyond naming conventions investigated by Bååth (2012), there are style-elements that R programmers have the freedom to adopt, e.g. whether or not to add spaces around infix operators, use double quotation marks or single quotation marks to denote strings. On one hand, these variations provide programmers with freedom. On the other hand, these variations can confuse new programmers and can have dire effects on program comprehension. Also, incompatibility between programming styles might also affect reusability, maintainability (Elish and Offutt 2002), and open source collaboration (Wang and Hahn 2017).
Various efforts to standardize the programming style, e.g. Google’s R
Style Guide (Google 2019), the Tidyverse Style Guide (Wickham 2017),
Bioconductor Coding Style (Bioconductor 2015), are available
(Table 1)
Among the 3 style-guides, the major differences are the suggested naming convention and indentation, as highlighted in Table 1. Other style-elements are essentially the same. These style guides are based on possible improvement in code quality, e.g. style-elements that improve program comprehension (Oman and Cook 1991). However, we argue that one should first study the current situation, and preferably, the historical development, of programming style variations (PSV) to supplement these standardization efforts. We have undertaken such a task, so that the larger R communities can have a baseline to evaluate the effectiveness of those standardization efforts. Also, we can have a better understanding of the factors driving increase and decrease in PSV historically, such that more effective standardization efforts can be formulated.
Feature | Tidyverse | Bioconductor | |
---|---|---|---|
Function name | UpperCamel | snake_case | lowerCamel |
Assignment | Discourage = | Discourage = | Discourage = |
Line length | “limit your code to 80 characters per line” | “limit your code to 80 characters per line” | |
Space after a comma | Yes | Yes | Yes |
Space around infix operators | Yes | Yes | Yes |
Indentation | 2 spaces | 2 spaces | 4 spaces |
Integer | Not specified | Not specified (Integers are not explicitly typed in included code examples) | Not specified |
Quotes | Double | Double | Not specified |
Boolean values | Use TRUE / FALSE | Use TRUE / FALSE | Not specified |
Terminate a line with a semicolon | No | No | Not specified |
Curly braces | { same line, then a newline, } on its own line | { same line, then a newline, } on its own line | Not specified |
On July 1, 2020, we cloned a local mirror of CRAN using the rsync method
suggested in the CRAN Mirror HOWTO (CRAN 2019).
In order to facilitate the analysis, we have developed the package
baaugwo
(Chan 2020) to extract all R source code and metadata from these
tarballs. In this study, only the source code from the /R
directory of
each tarball file is included. We have also archived the metadata from
the DESCRIPTION
and NAMESPACE
files from the tarballs.
In order to cancel out the over-representation effect of multiple submissions in a year by a particular package, we have applied the "one-submission-per-year" rule to randomly selected only one submission from a year for each package. Unless otherwise noticed, we present below the analysis of this "one-submission-per-year" sample. Similarly, unless otherwise noticed, the unit of the analysis is exported function. The study period for this study is from 1998 to 2019.
All exported functions in our sample are parsed into a parse tree using the parser from the lintr (Hester and Angly 2019) package.
These parse trees were then filtered for lines with function definition and then linted them using the linters from the lintr package to detect for various style-elements. Style-elements considered in this study are:
Use = as assignment operators
softplusFunc = function(value, leaky = FALSE) {
if (leaky) {
warnings("using leaky RELU!")
return(ifelse(value > 0L, value, value * 0.01))
}
return(log(1L + exp(value)))
}
An open curly is on its own line
softplusFunc <- function(value, leaky = FALSE)
{
if (leaky)
{
warnings("using leaky RELU!")
return(ifelse(value > 0L, value, value * 0.01))
}
return(log(1L + exp(value)))
}
No spaces are added around infix operators.
softplusFunc<-function(value, leaky=FALSE) {
if (leaky) {
warnings("using leaky RELU!")
return(ifelse(value>0L, value, value*0.01))
}
return(log(1L+exp(value)))
}
Not explicitly type integers
softplusFunc <- function(value, leaky = FALSE) {
if (leaky) {
warnings("using leaky RELU!")
return(ifelse(value > 0, value, value * 0.01))
}
return(log(1 + exp(value)))
}
Use single quotation marks for strings
softplusFunc <- function(value, leaky = FALSE) {
if (leaky) {
warnings('using leaky RELU!')
return(ifelse(value > 0L, value, value * 0.01))
}
return(log(1L + exp(value)))
}
No space is added after commas
softplusFunc <- function(value,leaky = FALSE) {
if (leaky) {
warnings("using leaky RELU!")
return(ifelse(value > 0L,value,value * 0.01))
}
return(log(1L + exp(value)))
}
Use semicolons to terminate lines
softplusFunc <- function(value, leaky = FALSE) {
if (leaky) {
warnings("using leaky RELU!");
return(ifelse(value > 0L, value, value * 0.01));
}
return(log(1L + exp(value)));
}
Use T/F instead of TRUE / FALSE
softplusFunc <- function(value, leaky = F) {
if (leaky) {
warnings("using leaky RELU!")
return(ifelse(value > 0L, value, value * 0.01))
}
return(log(1L + exp(value)))
}
An close curly is not on its own line.
softplusFunc <- function(value, leaky = FALSE) {
if (leaky) {
warnings("using leaky RELU!")
return(ifelse(value > 0L, value, value * 0.01)) }
return(log(1L + exp(value))) }
Use tab to indent
softplusFunc <- function(value, leaky = FALSE) {
if (leaky) {
warnings("using leaky RELU!")
return(ifelse(value > 0L, value, value * 0.01))
}
return(log(1L + exp(value)))
}
We have studied also the naming conventions of all included functions. Using the similar technique of Bååth (2012), we classified function names into the following 7 categories:
The last style-element is line-length. For each R file, we counted the distribution of line-length. In this analysis, the unit of analysis is line.
If not considering line-length, the remaining 10 binary and one
multinomial leave 7,168 possible combinations of PSVs that a programmer
could employ (
On top of the overall patterns based on the analysis of all functions, the community-specific variations are also studied. In this part of the study, we ask the question: do local patterns of PSV exist in various programming communities? To this end, we constructed a dependency graph of CRAN packages by defining a package as a node and an import/suggest relationship as a directed edge. Communities in this dependency graph were extracted using the Walktrap Community Detection Algorithm (Pons and Latapy 2005) provided by the igraph package (Csardi and Nepusz 2006). The step parameter was set at 4 for this analysis. Notably, we analyzed the dependency graph as a snapshot, which is built based on the submission history of every package from 1998 to 2019.
By applying the Walktrap Community Detection on the 2019 data, we have identified 931 communities in this CRAN dependency graph. The purpose of this analysis is to show the PSV in different communities. We selected the largest 20 communities for further analysis. The choice of 20 is deemed enough to show these community-specific variations. These 20 identified communities cover 88% of the total 14,491 packages, which shows that the coverage of our analysis is comprehensive. Readers could explore other choices themselves using our openly shared data.
As discussed in Gillespie and Lovelace (2016), maintaining a consistent style in source code can enable efficient reading by multiple readers; it is even thought to be a quality of a successful R programmer. In addition to community-level analysis, we extend our work to the package-level, in which we investigate the consistency of different style elements within a package. In this analysis, we studied 12 style elements, including fx_assign, fx_commas, fx_integer, fx_semi, fx_t_f, fx_closecurly, fx_infix, fx_opencurly, fx_singleq, fx_tab, and fx_name. In other words, 11 binary variables (the first 11) and 1 multinomial variable (fx_name) could be assigned to each function within a package.
We quantified the package-level consistency by computing the entropy for
each style element. Given a style element S of an R package
As the value of
Finally, we calculate the
We studied more than 108 million lines of code from 17,692 unique packages. In total, 2,249,326 exported functions were studied. Figure 1 displays the popularity of the 10 binary style-elements from 1998 to 2019. Some style-elements have very clear trends towards a majority-vs-minority pattern, e.g. fx_closecurly, fx_semi, fx_t_f and fx_tab. Some styles-elements are instead trending towards a divergence from a previous majority-vs-minority pattern, e.g. fx_assign, fx_commas, fx_infix, fx_integer, fx_opencurly and fx_singleq. There are two style-elements that deserve special scrutiny. Firstly, the variation in fx_assign is an illustrative example of the effect of introducing a new language feature by the R Development Core Team. The introduction of the language feature (= as assignment operator) in R 1.4 (Chambers 2001) has coincided with the taking off in popularity of such style-element since 2001. Up to now, around 20% of exported functions use such style.
Secondly, the popularity of fx_opencurly shows how a previously established majority style (around 80% in late 90s) slowly reduced into a minority, but still prominent, style (around 30% in late 10s).
Similarly, the evolution of different naming conventions is shown in
Figure 2
The evolution of line lengths is tricky to be visualized on a 2-D surface. We have prepared a Shiny app (https://github.com/chainsawriot/rstyle/tree/master/shiny) to visualize the change in line distribution over the span of 21 years. In this paper, Figure 3 shows the snapshot of the change in line length distribution in the range of 40 to 100 characters. In general, developers of newer packages write with less characters per line. Similar to previous analyses with Python programs e.g.@vanderplas, artificial peaks corresponding to recommendations from either style-guides, linters, and editor settings are also observed in our analysis. In 2019, the artificial peak of 80 characters (recommended by most of the style-guides and linters such as lintr) is more pronounced for lines with comments but not those with actual code.
Using the aforementioned community detection algorithm of the dependency
graph, the largest 20 communities were extracted. These communities are
named by their applications. Table 2 lists the details of
these communities
Using the naming convention as an example, there are local patterns in PSV (Figure 4). For example, lower_snake case is the most popular naming convention in the "RStudio" community as expected because it is the naming convention endorsed by the Tidyverse Style-guide. However, only a few functions exported by the packages from "GUI: Gtk" community uses such convention.
Community | Number of Packages | Top 3 Packages |
---|---|---|
base | 5157 | methods, stats, MASS |
RStudio | 4758 | testthat, knitr, rmarkdown |
Rcpp | 826 | Rcpp, tinytest, pinp |
Statistical Analysis | 463 | survival, Formula, sandwich |
Machine Learning | 447 | nnet, rpart, randomForest |
Geospatial | 367 | sp, rgdal, maptools |
GNU gsl | 131 | gsl, expint, mnormt |
Graph | 103 | graph, Rgraphviz, bnlearn |
Text Analysis | 79 | tm, SnowballC, NLP |
GUI: Tcl/Tk | 55 | tcltk, tkrplot, tcltk2 |
Infrastructure | 54 | rsp, listenv, globals |
Numerical Optimization | 51 | polynom, magic, numbers |
Genomics | 43 | Biostrings, IRanges, S4Vectors |
RUnit | 38 | RUnit, ADGofTest, fAsianOptions |
Survival Analysis | 33 | kinship2, CompQuadForm, coxme |
Sparse Matrix | 32 | slam, ROI, registry |
GUI: Gtk | 31 | RGtk2, gWidgetstcltk, gWidgetsRGtk2 |
Bioinformatics | 29 | limma, affy, marray |
IO | 28 | RJSONIO, Rook, base64 |
rJava | 27 | rJava, xlsxjars, openNLP |
For the binary style-elements, local patterns are also observed (Figure 5). The most salient pattern is the exceptional high usage of tab indentation in "rJava" and "Bioinformatics" communities, probably due to influences from Java or Perl. Also, packages in "GUI: Gtk" have an exceptional high usage of open curly on its own line.
The result shows that the consistency of style elements within a package varies (Figure 6). For example, style elements like fx_integer, fx_commas, fx_infix, fx_opencurly, and fx_name have less consistency within a package than fx_tab, fx_semi, fx_t_f, fx_closecurly, fx_singleq, and fx_assign. Based on our within-package analysis, we noticed that it is rare for a package to use a consistent style in all of its functions, except those packages with only a few functions. This finding prompts previous concerns e.g. Oman and Cook (1991; Elish and Offutt 2002; Gillespie and Lovelace 2016; Wang and Hahn 2017) that these inconsistent style variations within a software project (e.g. in an R package) might make open source collaboration difficult.
In Figure 7, we contextualize this finding by showing the distribution of fx_name in 20 R packages with the highest PageRank (Page et al. 1999) in the CRAN dependency graph. Many of these packages have only 1 dominant naming convention (e.g. lower_snake or lowerCamel), but not always. For instance, functions with 6 different naming conventions can be found in the package Rcpp.
In this study, we study the PSV in 21 years of CRAN packages across two dimensions: 1) temporal dimension: the longitudinal changes in popularity of various style-elements over 21 years, and 2) cross-sectional dimension: the variations among communities of the latest snapshot of all packages from 1998 to 2019. From our analysis, we identify three factors that possibly drive PSV: the effect of style-guides (trending of naming conventions endorsed by Wickham (2017) and Google (2019)), the effect of introducing a new language feature (trending of = usage as assignments after 2001), and the effect of editors (the dominance of 80-character line limit).
From a policy recommendation standpoint, our study provides important insight for the R Development Core Team and other stakeholders to improve the current situation of PSV in R. First, the introduction of a new language feature can have a very long-lasting effect on PSV. "Assignments with the = operator" is a feature that introduced by the R Development Core Team to “increase compatibility with S-Plus (as well as with C, Java, and many other languages)” (Chambers 2001). This might be a good intention but it has an unintended consequence of introducing a very persistent PSV that two major style-guides, Wickham (2017) and Google (2019), consider as a bad style.
Second, style-guides, linters, and editors are important standardizers
of PSV. Although we have not directly measured the use of style-guides,
linters, and editors in our analysis
Our analysis also opens up an open question: should R adopt an official style-guide akin the PEP-8 of the Python Software Foundation (Van Rossum et al. 2001)? There are of course pros and cons of adopting an official style-guide. As written by Christiansen (1998), “style can easily become a religious issue.” It is not our intention to meddle in this “religious issue.” If such an effort would be undertaken by someone else, the following consensus-based style could be used as the basis. The following is an example of a function written in such style.
softplus_func <- function(value, leaky = FALSE) {
if (leaky) {
warnings("using leaky RELU!")
return(ifelse(value > 0, value, value * 0.01))
}
return(log(1 + exp(value)))
}
In essence,
We must stress here that this consensus-based style is only the most
popular style based on our analysis, i.e. the Zeitgeist (the spirit of
the age)
The data and scripts to reproduce the analysis in this paper are available at https://github.com/chainsawriot/rstyle. An archived version is available at this DOI: http://doi.org/10.5281/zenodo.4026589.
We have presented a previous version of this paper as a poster at UseR! 2019 Toulouse. Interested readers can access it with the following link: https://github.com/chainsawriot/rstyle/blob/master/docs/Poster_useR2019_Toulouse.png. The work was done prior to Ms Chang joining Amazon Web Services.
The authors would like to thank Wush Wu, Liang-Bo Wang, Taiwan R User group, R-Ladies Taipei, attendees of UseR! 2019, and the two reviewers for their valuable comments that greatly improved this paper.
tidyverse, lintr, igraph, methods, utils, MASS, testthat, knitr, rmarkdown, Rcpp, tinytest, pinp, survival, Formula, sandwich, nnet, rpart, randomForest, sp, rgdal, maptools, gsl, expint, mnormt, graph, Rgraphviz, bnlearn, tm, SnowballC, NLP, tcltk, tkrplot, tcltk2, rsp, listenv, globals, polynom, magic, numbers, RUnit, ADGofTest, fAsianOptions, kinship2, CompQuadForm, coxme, slam, ROI, registry, RGtk2, gWidgetstcltk, gWidgetsRGtk2, limma, RJSONIO, Rook, base64, rJava, xlsxjars, openNLP, styler, goodpractice
Agriculture, Bayesian, ClinicalTrials, Distributions, Econometrics, Environmetrics, Finance, GraphicalModels, HighPerformanceComputing, MachineLearning, MissingData, MixedModels, NaturalLanguageProcessing, NumericalMathematics, Optimization, Psychometrics, ReproducibleResearch, Robust, Spatial, SpatioTemporal, Survival, TeachingStatistics, WebTechnologies
Biostrings, IRanges, S4Vectors, affy, marray
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Yen, et al., "A Computational Analysis of the Dynamics of R Style Based on 108 Million Lines of Code from All CRAN Packages in the Past 21 Years", The R Journal, 2022
BibTeX citation
@article{RJ-2022-006, author = {Yen, Chia-Yi and Chang, Mia Huai-Wen and Chan, Chung-hong}, title = {A Computational Analysis of the Dynamics of R Style Based on 108 Million Lines of Code from All CRAN Packages in the Past 21 Years}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {1}, issn = {2073-4859}, pages = {6-21} }