Computational infrastructure for representing persons and citations has been available in R for several years, but has been restructured through enhanced classes "person"
and "bibentry"
in recent versions of R. The new features include support for the specification of the roles of package authors (e.g. maintainer, author, contributor, translator, etc.) and more flexible formatting/printing tools among various other improvements. Here, we introduce the new classes and their methods and indicate how this functionality is employed in the management of R packages. Specifically, we show how the authors of R packages can be specified along with their roles in package DESCRIPTION
and/or CITATION
files and the citations produced from it.
R packages are the result of scholarly activity and as such constitute scholarly resources which must be clearly identifiable for the respective scientific communities and, more generally, today’s information society. In particular, packages published by standard repositories can be regarded as reliable sources which can and should be referenced (i.e. cited) by scientific works such as articles or other packages. This requires conceptual frameworks and computational infrastructure for describing bibliographic resources, general enough to encompass the needs of communities with an interest in R. These needs include support for exporting bibliographic metadata in standardized formats such as BibTeX (Berry and Patashnik 2010), but also facilitating bibliometric analyses and investigations of the social fabric underlying the creation of scholarly knowledge.
The latter requires a richer vocabulary than commonly employed by reference management software such as BibTeX, identifying persons and their roles in relation to bibliographic resources. For example, a thesis typically has an author and advisors. Software can have an (original) author and a translator to another language (such as from S to R). The maintainer of an R package is not necessarily an author.
In this paper, we introduce the base R infrastructure (as completely
available in R since version 2.14.0) for representing and manipulating
such scholarly data: objects of class "person"
(hereafter, person
objects) hold information about persons, possibly including their roles;
objects of class "bibentry"
(hereafter, bibentry objects) hold
bibliographic information in enhanced BibTeX style, ideally using person
objects when referring to persons (such as authors or editors).
Furthermore, we indicate how this functionality is employed in the
management of R packages, in particular in their CITATION
and
DESCRIPTION
files.
Person objects can hold information about an arbitrary positive number
of persons. These can be obtained by one call to person()
with list
arguments, or by first creating objects representing single persons and
combining these via c()
.
Every person has at least one given name, and natural persons typically
(but not necessarily, Wikipedia 2012) have a family name. These
names are specified by arguments given
and family
, respectively. For
backward compatibility with older versions, there are also arguments
first
, middle
and last
with obvious (Western-centric) meaning;
however, these are deprecated as not universally applicable: some
cultures place the given after the family name, or use no family name.
An optional email
argument allows specifying persons’ email addresses,
which may serve as unique identifiers for the persons.
Whereas person names by nature may be arbitrary non-empty character
strings, the state of the art in library science is to employ
standardized vocabularies for person roles. R uses the “Relator and
Role” codes and terms (http://www.loc.gov/marc/relators/relaterm.html)
from MARC (MAchine-Readable Cataloging, Library of Congress 2012), one of the leading
collections of standards used by librarians to encode and share
bibliographic information. Argument role
specifies the roles using the
MARC codes; if necessary, the comment
argument can be used to provide
(human-readable) character strings complementing the standardized role
information.
As an example (see Figure 1 and also ?person
), consider
specifying the authors of package
boot (Canty and Ripley 2012): Angelo
Canty is the first author ("aut"
) who wrote the S original; Brian D.
Ripley is the second author ("aut"
), who translated the code to R
("trl"
), and maintains the package ("cre"
).
Note that the MARC definition of the relator term “Creator” is “Use for a person or organization responsible for the intellectual or artistic content of a work.”, which maps nicely to the notion of the maintainer of an R package: the maintainer creates the package as a bibliographic resource for possible redistribution (e.g. via the CRAN package repository), and is the person (currently) responsible for the package, being the “interface” between what is inside the package and the outside world, and warranting in particular that the license and representation of intellectual property rights are appropriate. For the persons who originally “created” the package code, typically author roles should be employed.
In general, while all MARC relator codes are supported, the following usage is suggested when giving the roles of persons in the context of authoring R packages:
"aut"
(Author):Full authors who have made substantial contributions to the package and should show up in the package citation.
"com"
(Compiler):Persons who collected code (potentially in other languages) but did not make further substantial contributions to the package.
"ctb"
(Contributor):Authors who have made smaller contributions (such as code patches etc.) but should not show up in the package citation.
"cph"
(Copyright holder):Copyright holders.
"cre"
(Creator):Package maintainer.
"ths"
(Thesis advisor):Thesis advisor, if the package is part of a thesis.
"trl"
(Translator):Translator to R, if the R code is a translation from another language (typically S).
Person objects can be subscripted by field (using $
) or by position
(using [
), and allow for flexible formatting and printing. By default,
email, role and comment entries are suitably “embraced” by enclosing
with angle brackets, square brackets and parentheses, respectively, and
suitably collapsed. There is also a toBibtex()
method which creates a
suitable BibTeX representation. Finally, the default method for
as.person()
is capable of inverting the default person formatting, and
also can handle formatted person entries collapsed by comma or and
(with appropriate white space). See Figure 1 and
?person
for various illustrative usages of these methods.
R code specifying person objects can also be used in the Authors@R
field of DESCRIPTION
files. This information is used by the R package
management system (specifically, R CMD build
and R CMD INSTALL
) to
create the Author
and Maintainer
fields from Authors@R
, unless
these are still defined (statically) in DESCRIPTION
. Also, the person
information is used for reliably creating the author information in
auto-generated package citations (more on this later). Note that there
are no default roles for Authors@R
: all authors must be given an
author role ("aut"
), and the maintainer must be identified by the
creator role ("cre"
).
Bibentry objects can hold information about an arbitrary positive number
of bibliography entries. These can be obtained by one call to
bibentry()
with list arguments, or by first creating objects
representing single bibentries and combining these via c()
.
Bibentry objects represent bibliographic information in “enhanced BibTeX
style”, using the same entry types (such as ‘Article’ or ‘Book’) and
field names (such as ‘author’ and ‘year’) as BibTeX. A single bibentry
is created by calling bibentry()
with arguments bibtype
(typically
used positionally) and key
(optionally) giving the type of the entry
and a key for it, and further fields given in tag`` = ``value
form,
with tag
and value
the name and value of the field, respectively.
See ?bibentry
for details on types and fields. The author and editor
field values can be person objects as discussed above (and are coerced
to such using as.person()
if they are not). In addition, it is
possible to specify header and footer information for the individual
entries (arguments header
and footer
) and the collection of entries
(arguments mheader
and mfooter
).
Bibentry objects can be subscripted by field (using $
) or by position
(using [
), and have print()
and format()
methods providing a
choice between at least six different styles (as controlled by a style
argument): plain text, HTML, LaTeX, BibTeX, R (for formatting bibentries
as (high-level) R code), and citation information (including
headers/footers and a mixture of plain text and BibTeX). The first three
make use of a .bibstyle
argument using the bibstyle()
function in
package tools which defines the display style of the citation by
providing a collection of functions to format the individual parts of
the entry. Currently the default style "JSS"
, which corresponds to the
bibliography style employed by the Journal of Statistical Software
(http://www.jstatsoft.org/), is the only available style, but it can
be modified by replacing the component functions, or replaced with a
completely new style. A style is implemented as an environment, and is
required to contain functions to format each type of bibliographic entry
(formatArticle()
, formatBook()
, etc.) as well a function
sortKeys()
to sort entries. In addition to these required functions,
the "JSS"
style includes several dozen auxiliary functions that are
employed by the required ones. Users modifying it should read the help
page ?tools::bibstyle
, which includes an example that changes the sort
order. Those implementing a completely new style should consult the
source code.
Figure 2 shows an illustrative example for using : A
vector with two bibentries is created for the boot package and the
book whose methods it implements. In the first bibentry the person
object p
from Figure 1 is reused while the second
bibentry applies to a string with author and role information.
Subsequently, the bibentry is printed in "JSS"
style and transformed
to . Note that the roles of the persons in the author field are not
employed in as does not support that concept. (Hence, the role
information in the call creating b[2]
would not have been necessary
for producing output.)
Currently, the primary use of bibentry objects is in package CITATION
files, which contain R code for defining such objects.1 Function
citation()
collects all of the defined bibentry objects into a single
object for creating package citations, primarily for visual inspection
and export to BibTeX. (Technically, citation()
creates an object of
class "citation"
inheriting from "bibentry"
, for which the print()
method uses a different default style.)
Package citations can also be auto-generated from the package
DESCRIPTION
metadata, see Figure 3 for an example:
The bibentry title field is generated from the package metadata fields
Package
and Title
which should be capitalized (in title case), and
not use markup or end in a period. (Hence, it may need manual touch-ups,
e.g. protecting {S}
or {B}ayesian
etc.) The year is taken from
Date/Publication
(for CRAN packages) or Date
(if available), and the
note field is generated from the package’s Version
. If the package was
installed from CRAN, the official stable CRAN package URL (here:
http://CRAN.R-project.org/package=boot) is given in the url field.2
Finally, the bibentry author field is generated from the description
Author
field (unless there is an Authors@R
, see below). As the
Author
field in DESCRIPTION
is a plain text field intended for human
readers, but not necessarily for automatic processing, it may happen
that the auto-generated is incorrect (as in Figure 3).
Hence, provides an ATTENTION note to this end. One can overcome this
problem by providing an Authors@R
field in DESCRIPTION
, from which
the names of all persons with author roles can reliably be determined
and included in the author field. Note again that author roles ("aut"
)
are not “implied” as defaults, and must be specified explicitly.
As illustrated in Figure 3, one can inspect the
auto-generated package citation bibentry by giving auto = TRUE
when
calling citation()
on a package – even if the package provides a
CITATION
file in which case auto
defaults to FALSE
. The motivation
is that it must be possible to refer to the package itself, even if
the preferred way of citing the package in publications is via the
references given in the package CITATION
file. Of course, both
approaches might also be combined as in the boot package where one of
the bibentries in CITATION
corresponds essentially to the
auto-generated package bibentry (see citation("boot")
). To facilitate
incorporating the auto-generated citations into CITATION
, one can
include a citation(auto = meta)
entry: when evaluating the code in the
CITATION
file, citation()
automagically sets meta
to the package
metadata. This eliminates the need for more elaborate programmatic
constructions of the package citation bibentry (or even worse, manually
duplicating the relevant package metadata), provided that
auto-generation is reliable (i.e. correctly processes author names and
does not need additional LaTeX markup or protection in the title). To
check whether this is the case for a particular package’s metadata and
hence citation(auto = meta)
can conveniently be used in CITATION
,
one can employ constructs such as
citation(auto = packageDescription("boot"))
.
Package maintainers should provide an Authors@R
entry in their
DESCRIPTION
files to allow computations on the list of authors, in
particular to ensure that auto-generation of bibliographic references to
their packages works reliably. Typically, the Author
and Maintainer
entries in DESCRIPTION
can then be omitted in the sources (as these
are automatically added by R CMD build
).
Packages wanting to employ the auto-generated citation as (one of) the
officially recommended reference(s) to their package can use the new
simplified citation(auto = meta)
to do so.3 Note that if this is
used then the package’s citation depends on R at least 2.14.0,
suggesting to declare this requirement in the Depends
field of
DESCRIPTION
.
As of 2012-04-02, CRAN had 3718 packages in total, with about 20% (734)
having CITATION
files, but less than 2.5% (92) already providing an
Authors@R
entry in their package metadata: so clearly, there is
substantial room for future improvement.
R has provided functionality for representing persons and generating
package citations for several years now. Earlier versions used
CITATION
files with calls to citEntry()
(and possibly citHeader()
and citFooter()
). R 2.12.0 introduced the new bibentry functionality
for improved representation and manipulation of bibliographic
information (the old mechanism is still supported and implemented by the
new one). R 2.12.0 also added the new person functionality (changing to
the more appropriate given/family scheme and adding roles). R 2.14.0
added more functionality, including the Authors@R
field in package
DESCRIPTION
files from which Author
and Maintainer
fields in this
file as well as package bibentries can reliably be auto-generated.
Having reliable comprehensive information on package “authors” and their roles which can be processed automatically is also highly relevant for the purpose of understanding the roles of persons in the social fabric underlying the remarkable growth of the R package multiverse, opening up fascinating possibilities for research on R (of course, with R), e.g. using social network analysis and related methods.
Since R 2.10.0, R help files have been able to contain R code in
\Sexpr
macros to be executed before the help text is displayed. In the
future, it is planned to use this facility to auto-generate
\references
sections based on citations in the text, much as LaTeX and
BibTeX can automatically generate bibliographies.
These enhancements will clearly make dealing with bibliographic
information in Rd files much more convenient and efficient, and
eliminate the need to “manually” (re)format entries in the \references
sections. Of course, this should not come at the price of needing to
manually generate bibentries for references already available in other
bibliographic formats. For BibTeX format .bib
databases (presumably
the format typically employed by R users), conversion is
straightforward: one can read the entries in .bib
files into bibentry
objects using read.bib()
in CRAN package
bibtex (François 2011),
use format()
with style = "R"
to obtain a character vector with
bibentry()
calls for each bibentry, and use writeLines()
to write
(or add) this to a bibentry database file (see ?bibentry
for examples
of the last two steps). For other formats, one could first transform to
BibTeX format: e.g. Bibutils (Putnam 2010) can convert between the
bibliography formats COPAC, EndNote refer, EndNote XML, Pubmed XML, ISI
Web of Science, US Library of Congress MODS XML, RIS, and Word 2007.
However, this may lose useful information: e.g. the MODS format can
provide role data which (as discussed above) will be dropped when
converting to BibTeX format. We are thus planning to provide a
bibutils package which instead converts to the MODS XML format first,
and creates bibentry objects from this. However, as this functionality
requires both external software and package
XML (Temple Lang 2012), it cannot
be made available within base R.
With these functionalities in place, managing bibliographic information in R documentation files will become much more convenient, and R will become a much sharper knife for cutting-edge bibliometric and scientometric analyses. In particular, it will become possible to programmatically analyze package citations, and use these to learn about the processes of knowledge dissemination driving the creation of R packages, and the relatedness of package authors and concepts.
Econometrics, Optimization, ReproducibleResearch, Survival, TimeSeries, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hornik, et al., "Who Did What? The Roles of R Package Authors and How to Refer to Them", The R Journal, 2012
BibTeX citation
@article{RJ-2012-009, author = {Hornik, Kurt and Murdoch, Duncan and Zeileis, Achim}, title = {Who Did What? The Roles of R Package Authors and How to Refer to Them}, journal = {The R Journal}, year = {2012}, note = {https://rjournal.github.io/}, volume = {4}, issue = {1}, issn = {2073-4859}, pages = {64-69} }