Since version 2.10.0, R includes expanded support for source references in R code and .Rd
files. This paper describes the origin and purposes of source references, and current and future support for them.
One of the strengths of R is that it allows “computation on the
language”, i.e. the parser returns an R object which can be manipulated,
not just evaluated. This has applications in quality control checks,
debugging, and elsewhere. For example, the pkgcodetools package
(Tierney 2009) examines the structure of parsed source code to look for
common programming errors. Functions marked by debug()
can be executed
one statement at a time, and the trace()
function can insert debugging
statements into any function.
Computing on the language is often enhanced by being able to refer to
the original source code, rather than just to a deparsed (reconstructed)
version of it based on the parsed object. To support this, we added
source references to R 2.5.0 in 2007. These are attributes attached to
the result of parse()
or (as of 2.10.0) parse_Rd()
to indicate where
a particular part of an object originated. In this article I will
describe their structure and how they are used in R. The article is
aimed at developers who want to create debuggers or other tools that use
the source references, at users who are curious about R internals, and
also at users who want to use the existing debugging facilities. The
latter group may wish to skip over the gory details and go directly to
the section “Using Source References".
We start with a quick introduction to the R parser. The parse()
function returns an R object of type "expression"
. This is a list of
statements; the statements can be of various types. For example,
consider the R source shown in Figure 1.
If we parse this file, we obtain an expression of length 3:
> parsed <- parse("sample.R")
> length(parsed)
1] 3 [
> typeof(parsed)
1] "expression" [
The first element is the assignment, the second element is the for
loop, and the third is the single x
at the end:
> parsed[[1]]
<- 1:10 x
> parsed[[2]]
for (i in x) {
print(i)
}
> parsed[[3]]
x
The first two elements are both of type "language"
, and are made up of
smaller components. The difference between an "expression"
and a
"language"
object is mainly internal: the former is based on the
generic vector type (i.e. type "list"
), whereas the latter is based on
the "pairlist"
type. Pairlists are rarely encountered explicitly in
any other context. From a user point of view, they act just like generic
vectors.
The third element x
is of type "symbol"
. There are other possible
types, such as "NULL"
, "double"
, etc.: essentially any simple
R object could be an element.
The comments in the source code and the white space making up the indentation of the third line are not part of the parsed object.
The parse_Rd()
function parses .Rd
documentation files. It also
returns a recursive structure containing objects of different types
(Murdoch and S. Urbanek 2009; Murdoch 2010).
As described above, the result of parse()
is essentially a list (the
"expression"
object) of objects that may be lists (the "language"
objects) themselves, and so on recursively. Each element of this
structure from the top down corresponds to some part of the source file
used to create it: in our example, parse[[1]]
corresponds to the first
line of sample.R
, parse[[2]]
is the second through fourth lines, and
parse[[3]]
is the fifth line.
The comments and indentation, though helpful to the human reader, are
not part of the parsed object. However, by default the parsed object
does contain a "srcref"
attribute:
> attr(parsed, "srcref")
1]]
[[<- 1:10
x
2]]
[[for (i in x) {
print(i) # Print each entry
}
3]]
[[ x
Although it appears that the "srcref"
attribute contains the source,
in fact it only references it, and the print.srcref()
method retrieves
it for printing. If we remove the class from each element, we see the
true structure:
> lapply(attr(parsed, "srcref"), unclass)
1]]
[[1] 1 1 1 9 1 9
[attr(,"srcfile")
sample.R
2]]
[[1] 2 1 4 1 1 1
[attr(,"srcfile")
sample.R
3]]
[[1] 5 1 5 1 1 1
[attr(,"srcfile")
sample.R
Each element is a vector of 6 integers: (first line, first byte, last
line, last byte, first character, last character). The values refer to
the position of the source for each element in the original source file;
the details of the source file are contained in a "srcfile"
attribute
on each reference.
The reason both bytes and characters are recorded in the source reference is historical. When they were introduced, they were mainly used for retrieving source code for display; for this, bytes are needed. Since R 2.9.0, they have also been used to aid in error messages. Since some characters take up more than one byte, users need to be informed about character positions, not byte positions, and the last two entries were added.
The "srcfile"
attribute is also not as simple as it looks. For
example,
> srcref <- attr(parsed, "srcref")[[1]]
> srcfile <- attr(srcref, "srcfile")
> typeof(srcfile)
1] "environment" [
> ls(srcfile)
1] "Enc" "encoding" "filename"
[4] "timestamp" "wd" [
The "srcfile"
attribute is actually an environment containing an
encoding, a filename, a timestamp, and a working directory. These give
information about the file from which the parser was reading. The reason
it is an environment is that environments are reference objects: even
though all three source references contain this attribute, in actuality
there is only one copy stored. This was done to save memory, since there
are often hundreds of source references from each file.
Source references in objects returned by parse_Rd()
use the same
structure as those returned by parse()
. The main difference is that in
Rd objects source references are attached to every component, whereas
parse()
only constructs source references for complete statements, not
for their component parts, and they are attached to the container of the
statements. Thus for example a braced list of statements processed by
parse()
will receive a "srcref"
attribute containing source
references for each statement within, while the statements themselves
will not hold their own source references, and sub-expressions within
each statement will not generate source references at all. In contrast
the "srcref"
attribute for a section in an .Rd
file will be a source
reference for the whole section, and each component part in the section
will have its own source reference.
"source"
attributeBy default the R parser also creates an attribute named "source"
when
it parses a function definition. When available, this attribute is used
by default in lieu of deparsing to display the function definition. It
is unrelated to the "srcref"
attribute, which is intended to point
to the source, rather than to duplicate the source. An integrated
development environment (IDE) would need to know the correspondence
between R code in R and the true source, and "srcref"
attributes are
intended to provide this.
"srcref"
attributes added?As mentioned above, the parser adds a "srcref"
attribute by default.
For this, it is assumes that options("keep.source")
is left at its
default setting of TRUE
, and that parse()
is given a filename as
argument file
, or a character vector as argument text
. In the latter
case, there is no source file to reference, so parse()
copies the
lines of source into a "srcfilecopy"
object, which is simply a
"srcfile"
object that contains a copy of the text.
Developers may wish to add source references in other situations. To do
that, an object inheriting from class "srcfile"
should be passed as
the srcfile
argument to parse()
.
The other situation in which source references are likely to be created
in R code is when calling source()
. The source()
function calls
parse()
, creating the source references, and then evaluates the
resulting code. At this point newly created functions will have source
references attached to the body of the function.
The section “Breakpoints” below discusses how to make sure that source references are created in package code.
For the most part, users need not be concerned with source references, but they interact with them frequently. For example, error messages make use of them to report on the location of syntax errors:
> source("error.R")
in source("error.R") : error.R:4:1: unexpected
Error 'else'
3: print( "less" )
4: else
^
A more recent addition is the use of source references in code being
executed. When R evaluates a function, it evaluates each statement in
turn, keeping track of any associated source references. As of R 2.10.0,
these are reported by the debugging support functions traceback()
,
browser()
, recover()
, and dump.frames()
, and are returned as an
attribute on each element returned by sys.calls()
. For example,
consider the function shown in Figure 2.
This function is syntactically correct, and works to calculate the absolute value of scalar values, but is not a valid way to calculate the absolute values of the elements of a vector, and when called it will generate an incorrect result and a warning:
> source("badabs.R")
> badabs( c(5, -10) )
1] 5 -10 [
:
Warning messageif (x < 0) x <- -x :
In > 1 and only the first
the condition has length element will be used
In this simple example it is easy to see where the problem occurred, but in a more complex function it might not be so simple. To find it, we can convert the warning to an error using
> options(warn=2)
and then re-run the code to generate an error. After generating the error, we can display a stack trace:
> traceback()
5: doWithOneRestart(return(expr), restart)
4: withOneRestart(expr, restarts[[1L]])
3: withRestarts({
.Internal(.signalCondition(
simpleWarning(msg, call), msg, call))
.Internal(.dfltWarn(msg, call))
muffleWarning = function() NULL) at badabs.R#2
}, 2: .signalSimpleWarning("the condition has length
> 1 and only the first element will be used",
quote(if (x < 0) x <- -x)) at badabs.R#3
1: badabs(c(5, -10))
To read a traceback, start at the bottom. We see our call from the
console as line “1:
”, and the warning being signalled in line “2:
”.
At the end of line “2:
” it says that the warning originated
“at badabs.R#3
”, i.e. line 3 of the badabs.R
file.
Users may also make use of source references when setting breakpoints.
The trace()
function lets us set breakpoints in particular
R functions, but we need to specify which function and where to do the
setting. The setBreakpoint()
function is a more friendly front end
that uses source references to construct a call to trace()
. For
example, if we wanted to set a breakpoint on badabs.R
line 3, we could
use
> setBreakpoint("badabs.R#3")
:\svn\papers\srcrefs\badabs.R#3:
D2 in <environment: R_GlobalEnv> badabs step
This tells us that we have set a breakpoint in step 2 of the function
badabs
found in the global environment. When we run it, we will see
> badabs( c(5, -10) )
#3
badabs.R: badabs(c(5, -10)) Called from
1]> Browse[
telling us that we have broken into the browser at the requested line,
and it is waiting for input. We could then examine x
, single step
through the code, or do any other action of which the browser is
capable.
By default, most packages are built without source reference
information, because it adds quite substantially to the size of the
code. However, setting the environment variable R_KEEP_PKG_SOURCE=yes
before installing a source package will tell R to keep the source
references, and then breakpoints may be set in package source code. The
envir
argument to setBreakpoints()
will need to be set in order to
tell it to search outside the global environment when setting
breakpoints.
#line
directiveIn some cases, R source code is written by a program, not by a human
being. For example, Sweave()
extracts lines of code from Sweave
documents before sending the lines to R for parsing and evaluation. To
support such preprocessors, the R 2.10.0 parser recognizes a new
directive of the form
#line nn "filename"
where nn
is an integer. As with the same-named directive in the C
language, this tells the parser to assume that the next line of source
is line nn
from the given filename for the purpose of constructing
source references. The Sweave()
function doesn’t currently make use of
this, but in the future, it (and other preprocessors) could output
#line
directives so that source references and syntax errors refer to
the original source location rather than to an intermediate file.
The #line
directive was a late addition to R 2.10.0. Support for this
in Sweave()
appeared in R 2.12.0.
The source reference structure could be improved. First, it adds quite a
lot of bulk to R objects in memory. Each source reference is an integer
vector of length 6 with a class and "srcfile"
attribute. It is hard to
measure exactly how much space this takes because much is shared with
other source references, but it is on the order of 100 bytes per
reference. Clearly a more efficient design is possible, at the expense
of moving support code to C from R. As part of this move, the use of
environments for the "srcfile"
attribute could be dropped: they were
used as the only available R-level reference objects. For developers,
this means that direct access to particular parts of a source reference
should be localized as much as possible: They should write functions to
extract particular information, and use those functions where needed,
rather than extracting information directly. Then, if the implementation
changes, only those extractor functions will need to be updated.
Finally, source level debugging could be implemented to make use of
source references, to single step through the actual source files,
rather than displaying a line at a time as the browser()
does.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Murdoch, "Source References", The R Journal, 2010
BibTeX citation
@article{RJ-2010-010, author = {Murdoch, Duncan}, title = {Source References}, journal = {The R Journal}, year = {2010}, note = {https://rjournal.github.io/}, volume = {2}, issue = {2}, issn = {2073-4859}, pages = {16-19} }