A Computational Analysis of the Dynamics of R Style Based on 108 Million Lines of Code from All CRAN Packages in the Past 21 Years

The flexibility of R and the diversity of the R community leads to a large number of programming styles applied in R packages. We have analyzed 108 million lines of R code from CRAN and quantified the evolution in popularity of 12 style-elements from 1998 to 2019. We attribute 3 main factors that drive changes in programming style: the effect of style-guides, the effect of introducing new features, and the effect of editors. We observe in the data that a consensus in programming style is forming, such as using lower snake case for function names (e.g. softplus_func) and <- rather than = for assignment.

Chia-Yi Yen (Mannheim Business School, Universität Mannheim) , Mia Huai-Wen Chang (Amazon Web Services) , Chung-hong Chan (Mannheimer Zentrum für Europäische Sozialforschung, Universität Mannheim)
2022-06-21

1 Introduction

R is flexible. For example, one can use <- or = as assignment operators. The following two functions can both be correctly evaluated.

sum_of_square <- function(x) {
    return(sum(x^2))
}
sum_oF.square=function(x)
{
    sum(x ^ 2)}

One area that can highlight this flexibility is naming conventions. According to the previous research by Bååth (2012), there are at least 6 styles and none of the 6 has dominated the scene. Beyond naming conventions investigated by Bååth (2012), there are style-elements that R programmers have the freedom to adopt, e.g. whether or not to add spaces around infix operators, use double quotation marks or single quotation marks to denote strings. On one hand, these variations provide programmers with freedom. On the other hand, these variations can confuse new programmers and can have dire effects on program comprehension. Also, incompatibility between programming styles might also affect reusability, maintainability (Elish and Offutt 2002), and open source collaboration (Wang and Hahn 2017).

Various efforts to standardize the programming style, e.g. Google’s R Style Guide (Google 2019), the Tidyverse Style Guide (Wickham 2017), Bioconductor Coding Style (Bioconductor 2015), are available (Table 1) 1.

Among the 3 style-guides, the major differences are the suggested naming convention and indentation, as highlighted in Table 1. Other style-elements are essentially the same. These style guides are based on possible improvement in code quality, e.g. style-elements that improve program comprehension (Oman and Cook 1991). However, we argue that one should first study the current situation, and preferably, the historical development, of programming style variations (PSV) to supplement these standardization efforts. We have undertaken such a task, so that the larger R communities can have a baseline to evaluate the effectiveness of those standardization efforts. Also, we can have a better understanding of the factors driving increase and decrease in PSV historically, such that more effective standardization efforts can be formulated.

Table 1: Three major style-guides and their differentiating style elements (in Bold): Google, Tidyverse and Bioconductor
Feature Google Tidyverse Bioconductor
Function name UpperCamel snake_case lowerCamel
Assignment Discourage = Discourage = Discourage =
Line length “limit your code to 80 characters per line” “limit your code to 80 characters per line” \(\leqslant\) 80
Space after a comma Yes Yes Yes
Space around infix operators Yes Yes Yes
Indentation 2 spaces 2 spaces 4 spaces
Integer Not specified Not specified (Integers are not explicitly typed in included code examples) Not specified
Quotes Double Double Not specified
Boolean values Use TRUE / FALSE Use TRUE / FALSE Not specified
Terminate a line with a semicolon No No Not specified
Curly braces { same line, then a newline, } on its own line { same line, then a newline, } on its own line Not specified

2 Analysis

Data Source

On July 1, 2020, we cloned a local mirror of CRAN using the rsync method suggested in the CRAN Mirror HOWTO (CRAN 2019). 2 Our local mirror contains all contributed packages as tarball files (.tar.gz). By all contributed packages, we mean packages actively listed online on the CRAN website as well as orphaned and archived packages. In this analysis, we include all active, orphaned and archived packages.

In order to facilitate the analysis, we have developed the package baaugwo (Chan 2020) to extract all R source code and metadata from these tarballs. In this study, only the source code from the /R directory of each tarball file is included. We have also archived the metadata from the DESCRIPTION and NAMESPACE files from the tarballs.

In order to cancel out the over-representation effect of multiple submissions in a year by a particular package, we have applied the "one-submission-per-year" rule to randomly selected only one submission from a year for each package. Unless otherwise noticed, we present below the analysis of this "one-submission-per-year" sample. Similarly, unless otherwise noticed, the unit of the analysis is exported function. The study period for this study is from 1998 to 2019.

Quantification of PSV

All exported functions in our sample are parsed into a parse tree using the parser from the lintr (Hester and Angly 2019) package.

These parse trees were then filtered for lines with function definition and then linted them using the linters from the lintr package to detect for various style-elements. Style-elements considered in this study are:

fx_assign

Use = as assignment operators

softplusFunc = function(value, leaky = FALSE) {
    if (leaky) {
        warnings("using leaky RELU!")
        return(ifelse(value > 0L, value, value * 0.01))
    }
    return(log(1L + exp(value)))
}

fx_opencurly

An open curly is on its own line

softplusFunc <- function(value, leaky = FALSE) 
{
    if (leaky) 
    {
        warnings("using leaky RELU!")
        return(ifelse(value > 0L, value, value * 0.01))
    }
    return(log(1L + exp(value)))
}

fx_infix

No spaces are added around infix operators.

softplusFunc<-function(value, leaky=FALSE) {
    if (leaky) {
        warnings("using leaky RELU!")
        return(ifelse(value>0L, value, value*0.01))
    }
    return(log(1L+exp(value)))
}

fx_integer

Not explicitly type integers

softplusFunc <- function(value, leaky = FALSE) {
    if (leaky) {
        warnings("using leaky RELU!")
        return(ifelse(value > 0, value, value * 0.01))
    }
    return(log(1 + exp(value)))
}

fx_singleq

Use single quotation marks for strings

softplusFunc <- function(value, leaky = FALSE) {
    if (leaky) {
        warnings('using leaky RELU!')
        return(ifelse(value > 0L, value, value * 0.01))
    }
    return(log(1L + exp(value)))
}

fx_commas

No space is added after commas

softplusFunc <- function(value,leaky = FALSE) {
    if (leaky) {
        warnings("using leaky RELU!")
        return(ifelse(value > 0L,value,value * 0.01))
    }
    return(log(1L + exp(value)))
}

fx_semi

Use semicolons to terminate lines

softplusFunc <- function(value, leaky = FALSE) {
    if (leaky) {
        warnings("using leaky RELU!");
        return(ifelse(value > 0L, value, value * 0.01));
    }
    return(log(1L + exp(value)));
}

fx_t_f

Use T/F instead of TRUE / FALSE

softplusFunc <- function(value, leaky = F) {
    if (leaky) {
        warnings("using leaky RELU!")
        return(ifelse(value > 0L, value, value * 0.01))
    }
    return(log(1L + exp(value)))
}

fx_closecurly

An close curly is not on its own line.

softplusFunc <- function(value, leaky = FALSE) {
    if (leaky) {
        warnings("using leaky RELU!")
        return(ifelse(value > 0L, value, value * 0.01)) }
    return(log(1L + exp(value))) }

fx_tab

Use tab to indent

softplusFunc <- function(value, leaky = FALSE) {
    if (leaky) {
        warnings("using leaky RELU!")
        return(ifelse(value > 0L, value, value * 0.01))
    }
    return(log(1L + exp(value)))
}

We have studied also the naming conventions of all included functions. Using the similar technique of Bååth (2012), we classified function names into the following 7 categories:

The last style-element is line-length. For each R file, we counted the distribution of line-length. In this analysis, the unit of analysis is line.

If not considering line-length, the remaining 10 binary and one multinomial leave 7,168 possible combinations of PSVs that a programmer could employ (\(7 \times 2^{10} = 7,168\)).

Methodology of community-specific variations analysis

On top of the overall patterns based on the analysis of all functions, the community-specific variations are also studied. In this part of the study, we ask the question: do local patterns of PSV exist in various programming communities? To this end, we constructed a dependency graph of CRAN packages by defining a package as a node and an import/suggest relationship as a directed edge. Communities in this dependency graph were extracted using the Walktrap Community Detection Algorithm (Pons and Latapy 2005) provided by the igraph package (Csardi and Nepusz 2006). The step parameter was set at 4 for this analysis. Notably, we analyzed the dependency graph as a snapshot, which is built based on the submission history of every package from 1998 to 2019.

By applying the Walktrap Community Detection on the 2019 data, we have identified 931 communities in this CRAN dependency graph. The purpose of this analysis is to show the PSV in different communities. We selected the largest 20 communities for further analysis. The choice of 20 is deemed enough to show these community-specific variations. These 20 identified communities cover 88% of the total 14,491 packages, which shows that the coverage of our analysis is comprehensive. Readers could explore other choices themselves using our openly shared data.

Methodology of within-package variations analysis

As discussed in Gillespie and Lovelace (2016), maintaining a consistent style in source code can enable efficient reading by multiple readers; it is even thought to be a quality of a successful R programmer. In addition to community-level analysis, we extend our work to the package-level, in which we investigate the consistency of different style elements within a package. In this analysis, we studied 12 style elements, including fx_assign, fx_commas, fx_integer, fx_semi, fx_t_f, fx_closecurly, fx_infix, fx_opencurly, fx_singleq, fx_tab, and fx_name. In other words, 11 binary variables (the first 11) and 1 multinomial variable (fx_name) could be assigned to each function within a package.

We quantified the package-level consistency by computing the entropy for each style element. Given a style element S of an R package \(R_{i}\), with possible n choices \(s_{1}\), \(\dots\) \(s_{n}\) (e.g. n = 2 for binary; n = 7 for fx_names), the entropy \(H(S)\) is calculated as:

\[\label{eqn} H(S) = - \sum_{i = 1}^{n} P(s_{i}) \log P(s_{i}) \tag{1}\]

\(P(s_{i})\) is calculated as the proportion of all functions in \(R_{i}\) with the style element \(s_{i}\). For example, if a package has 4 functions and the S of these 4 functions are 0,0,1,2. The entropy \(H(S)\) is \(- ((0.5 \times \log 0.5) + (0.25 \times \log 0.25 ) + (0.25 \times \log 0.25)) = 0.45\).

As the value of \(H(S)\) is not comparable across different S with a different number of n, we normalize the value of \(H(S)\) into \(H'(S)\) by dividing \(H(S)\) with the theoretical maximum. The maximum values of \(H(S)\) for n = 2 and n = 7 are 0.693 and 1.946, respectively.

Finally, we calculate the \(\overline{H'(S)}\) of all CRAN packages (i.e. \(R_{1}\) \(\dots\) \(R_{n}\), where n equals the number of all CRAN packages) by averaging the \(H'(S)\).

3 Results

We studied more than 108 million lines of code from 17,692 unique packages. In total, 2,249,326 exported functions were studied. Figure 1 displays the popularity of the 10 binary style-elements from 1998 to 2019. Some style-elements have very clear trends towards a majority-vs-minority pattern, e.g. fx_closecurly, fx_semi, fx_t_f and fx_tab. Some styles-elements are instead trending towards a divergence from a previous majority-vs-minority pattern, e.g. fx_assign, fx_commas, fx_infix, fx_integer, fx_opencurly and fx_singleq. There are two style-elements that deserve special scrutiny. Firstly, the variation in fx_assign is an illustrative example of the effect of introducing a new language feature by the R Development Core Team. The introduction of the language feature (= as assignment operator) in R 1.4 (Chambers 2001) has coincided with the taking off in popularity of such style-element since 2001. Up to now, around 20% of exported functions use such style.

graphic without alt text
Figure 1: Evolution in popularity of 10 style-elements from 1998 to 2019.

Secondly, the popularity of fx_opencurly shows how a previously established majority style (around 80% in late 90s) slowly reduced into a minority, but still prominent, style (around 30% in late 10s).

Similarly, the evolution of different naming conventions is shown in Figure 2 3. This analysis can best be used to illustrate the effect of style-guides. According to Bååth (2012), dotted.func style is very specific to R programming. This style is the most dominant style in the early days of CRAN. However, multiple style guides advise against the use of dotted.func style and thus a significant declining trend is observed. lower_snake and UpperCamel are the styles endorsed by the Tidyverse Style Guide and the Google’s R Style Guide, respectively. These two styles see an increasing trend since the 2010s, while the growth of lower_snake is stronger, with almost a 20% growth in the share of all functions in contrast with the 1-2% growth of other naming conventions. In 2019, lower_snake (a style endorsed by Tidyverse) is the most popular style (26.6%). lowerCamel case, a style endorsed by Bioconductor, is currently the second most popular naming convention (21.3% in 2019). Only 7.0% of functions use UpperCamel, the style endorsed by Google.

graphic without alt text
Figure 2: Evolution in popularity of 7 naming conventions from 1998 to 2019.

The evolution of line lengths is tricky to be visualized on a 2-D surface. We have prepared a Shiny app (https://github.com/chainsawriot/rstyle/tree/master/shiny) to visualize the change in line distribution over the span of 21 years. In this paper, Figure 3 shows the snapshot of the change in line length distribution in the range of 40 to 100 characters. In general, developers of newer packages write with less characters per line. Similar to previous analyses with Python programs e.g.@vanderplas, artificial peaks corresponding to recommendations from either style-guides, linters, and editor settings are also observed in our analysis. In 2019, the artificial peak of 80 characters (recommended by most of the style-guides and linters such as lintr) is more pronounced for lines with comments but not those with actual code.

graphic without alt text
Figure 3: Change in line length distribution of comments (orange) and actual code (green): 2003, 2008, 2013 and 2019.

Community-based variations

Using the aforementioned community detection algorithm of the dependency graph, the largest 20 communities were extracted. These communities are named by their applications. Table 2 lists the details of these communities 4.

Using the naming convention as an example, there are local patterns in PSV (Figure 4). For example, lower_snake case is the most popular naming convention in the "RStudio" community as expected because it is the naming convention endorsed by the Tidyverse Style-guide. However, only a few functions exported by the packages from "GUI: Gtk" community uses such convention.

Table 2: The largest 20 communities and their top 3 packages according to PageRank
Community Number of Packages Top 3 Packages
base 5157 methods, stats, MASS
RStudio 4758 testthat, knitr, rmarkdown
Rcpp 826 Rcpp, tinytest, pinp
Statistical Analysis 463 survival, Formula, sandwich
Machine Learning 447 nnet, rpart, randomForest
Geospatial 367 sp, rgdal, maptools
GNU gsl 131 gsl, expint, mnormt
Graph 103 graph, Rgraphviz, bnlearn
Text Analysis 79 tm, SnowballC, NLP
GUI: Tcl/Tk 55 tcltk, tkrplot, tcltk2
Infrastructure 54 rsp, listenv, globals
Numerical Optimization 51 polynom, magic, numbers
Genomics 43 Biostrings, IRanges, S4Vectors
RUnit 38 RUnit, ADGofTest, fAsianOptions
Survival Analysis 33 kinship2, CompQuadForm, coxme
Sparse Matrix 32 slam, ROI, registry
GUI: Gtk 31 RGtk2, gWidgetstcltk, gWidgetsRGtk2
Bioinformatics 29 limma, affy, marray
IO 28 RJSONIO, Rook, base64
rJava 27 rJava, xlsxjars, openNLP
graphic without alt text
Figure 4: Community-specific distribution of naming conventions among 20 large communities.

For the binary style-elements, local patterns are also observed (Figure 5). The most salient pattern is the exceptional high usage of tab indentation in "rJava" and "Bioinformatics" communities, probably due to influences from Java or Perl. Also, packages in "GUI: Gtk" have an exceptional high usage of open curly on its own line.