This article suggests an implementation of the compendium concept by combining Sweave
and the LaTeX literate programming environment DOCSTRIP
.
(Gentleman and D. T. Lang 2007) introduced compendiums as a mechanism to combine text, data, and auxiliary software into a distributable and executable unit, in order to achieve reproducible research(Buckheit and D. L. Donoho 1995).:
“…research papers with accompanying software tools that allow the reader to directly reproduce the result and employ the methods that are presented …” (Gentleman and D. T. Lang 2007)
(Gentleman 2005) provides an example of how the compendium concept can be
implemented. The core of the implementation is a Sweave
(Leisch 2002).
See also (Leisch 2008) and Meredith and J. S. Racine (2008). source file. This source file is
then packaged together with data and auxiliary software as an R package.
In this article I suggest an alternative implementation of the
compendium concept combining Sweave
with DOCSTRIP
. The latter is the
LaTeX literate programming environment used in package documentation and
as an installation tool.(Mittelbach, D. Duchier, J. Braams, M. Wolińsky, and M. Wooding 2005) and Goossens, F. Mittelbach, and A. Samarin (1994). DOCSTRIP
has
previously been mentioned in the context of reproducible research, but
then mainly as a hard to use alternative to Sweave
.(Hothorn 2006) and
Rising (2008). Here, instead, DOCSTRIP
and Sweave
are combined.
Apart from the possibility to enjoy the functionality of Sweave
and
packages such as xtable
etc the main additional advantages are that
in many applications almost all code and data can be kept in a single source file,
multiple documents (i.e., PDF files) can share the same Sweave code chunks.
This means not only that administration of an empirical project is
facilitated but also that it becomes easier to achieve reproducible
research. Since DOCSTRIP
is a part of every LaTeX installation a
Sweave
user need not install any additional software. Finally,
Sweave
and DOCSTRIP
can be combined to produce more complex projects
such as R packages.
One example of the suggested implemention will be given. It contains R
code common to more than one document; an article containing the
advertisement of the research (using the terminology of Buckheit and D. L. Donoho (1995)),
and one technical documentation of the same research. In the following I
assume that the details of Sweave
are known to the readers of the R
Journal. The rest of the article will (i) give a brief introduction to
DOCSTRIP
, (ii) present and comment the example and (iii) close with
some final remarks.
Suppose we have a source file the entire or partial content of which
should be tangled into one or more result files. In order to determine
which part of the source file that should be tangled into a certain
result file (i) the content of the source file is tagged with none, one
or more tags (tag–lists) and (ii) the various tag–lists are associated
with the result files in a DOCSTRIP
“installation” file.
There are several ways to tag parts of the source file:
A single line: Start the line with ‘%<
tag-list
>
’.
Several lines, for instance one or more code or text chunks in Sweave
terminology: On a single line before the first line of the chunk enter
the start tag ‘%<*
tag-list
>
’ and on a single line after the last
line of the chunk the end tag ‘%</
tag-list
>
’.
All lines: Lines that should be in all result files are left untagged.
tag-list
is a list of tags combined with the Boolean operators ‘|
’
(logical or), ‘&
’ (logical and) and ‘!
’ (logical negation). A
frequent type of list would be, say, ‘tag1|tag2|tag3
’ which will
tangle the tagged material whenever tag1
, tag2
or tag3
is called
for into the result files these tags are associated with. The initial
‘%’ of the tags must be in the file’s first column or else the tag will
not be recognised as a DOCSTRIP
tag. Also, tags must be matched so a
start tag with tag-list
must be closed by an end tag with tag-list
.
This resembles the syntax of LaTeX environments rather than the Sweave
syntax, where the end of a code or text chunk is indicated by the
beginning of the next text or code chunk. Note also that tags cannot be
nested.1
The following source file (docex.txt
) exemplifies all three types of
tags:
<file1|file2>This line begins both files.
%<*file1>
%
in file1
This is the text that should be included
</file1>
%
in both files
This is the text to be included
<*file2>
%in file2
This is the text that should be included </file2>
%<*file1|file2>
%for both files.
Also text </file1|file2> %
For instance, line \(1\) is a single line tagged file1
or file2
,
line \(2\) starts and line \(6\) ends a tag file1
and line \(13\) starts and
line \(15\) ends a tag file1
or file2
. Lines \(7-9\) are untagged.
The next step is to construct a DOCSTRIP
installation file which
associates each tag with one or more result files:
\input docstrip.tex
\keepsilent
\askforoverwritefalse
\nopreamble
\nopostamble
\generate{
\file{file1.txt}{\from{docex.txt}{file1}}
\file{file2.txt}{\from{docex.txt}{file2}}
} \endbatchfile
Line \(1\) loads DOCSTRIP
. Lines \(2-5\) contain options that basically
tell DOCSTRIP
not to issue any messages, to write over any existing
result files and not to mess up the result files with pre– and
post-ambles.2 The action takes place on lines \(6-9\) within the
command ‘\generate{}
’, where lines \(7-8\) associate the tags file1
and file2
in the source file docex.txt
with the result files
file1.txt
and file2.txt
.3
We name this file docex.ins
, where .ins
is the conventional
extension for DOCSTRIP
installation files. DOCSTRIP
is then invoked
with
latex docex.ins
A log–file called docex.log
is created from which we here show the
most important parts (lines \(56-67\)):
file(s) ./file1.txt ./file2.txt
Generating = `./file1.txt'.
\openout0
\openout1 = `./file2.txt'.
Processing file docex.txt (file1) -> file1.txt
(file2) -> file2.txt
Lines processed: 15
Comments removed: 0
Comments passed: 0
Codelines passed: 8
We see that two result files are created from the \(15\) lines of code in
the source file. First file1.txt
;
This line begins both files.
in file1
This is the text that should be included
in both files
This is the text to be included
for both files. Also text
and file2.txt
;
This line begins both files.
in both files
This is the text to be included
in file2
This is the text that should be included for both files. Also text
Note that some lines are blank in both the original source file and the
result files. Disregarding these the two result files together have \(8\)
lines of code. The untagged material in lines \(7-9\) in the source files
is tangled into both result files, the blank lines \(7\) and \(8\) in the
source file result in the blank lines \(5\) and \(7\) in file1.txt
and the
blank lines \(2\) and \(4\) in file2.txt
.
In the following a simple example will be given of how DOCSTRIP
can be
combined with Sweave
to implement the compendium concept. The starting
point is a “research problem” which involves loading some data into R,
preprocessing the data, conducting an estimation and presenting the
result. The purpose is to construct a single compendium source file
which contains the code used to create (i) an “article” PDF–file which
will provide a brief account of the test and (ii) a “technical
documentation” PDF–file which gives a more detailed description of
loading and preprocessing data and the estimation. The source file also
contains the code of a BibTeX databse file and the DOCSTRIP
installation file. Although this source file is neither a LaTeX file or
a Sweave
file I will use the extension .rnw
since it first run
through Sweave
. Here we simplify the example by using data from an R
package, but if the data set is not too large it could be a part of the
source file.
We can think of the “article” as the “advertisement” intended for
journal publication and the “technical documentation” as a more complete
account of the actual research intended to be available on (say) a web
place. However, tables, graphs and individual statistics should
originate from the same R code so whenever Sweave
is executed these
are updated in both documents. There may also be common text chunks and
when they are changed in the source file, both documents are updated via
the result files.
The example code in the file example_source.rnw
is as follows:
<*article|techdoc>
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Author: Michael Lundholm
%% Email: michael.lundholm@ne.su.se
%% Date: 2010-09-06
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% The project files consists of
%% * example_source.rnw (THIS FILE)
%% To create other files execute
%% R CMD Sweave example_source.rnw
%% latex example.ins
%% pdflatex example_techdoc.tex
%% bibtex example_techdoc
%% pdflatex example_article.tex
%% bibtex example_article
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass{article}
\usepackage{Sweave,amsmath,natbib}</article|techdoc>
%<article>\title{Article}
%<techdoc>\title{Technical documentation}
%<*article|techdoc>
%
\author{M. Lundholm}
\begin{document}
\maketitle56ff]{Kleiber08}
This note replicates the \citet[p.
estimation of a price per citation elasticity offor economics journals equal
library subscriptions $\input{coef.txt}$.
to </article|techdoc>
%<*techdoc>
%in R package AER on CRAN, is loaded:
The data, available <<Loading data>>=
data("Journals",package="AER")
@
</techdoc>
%<*article|techdoc>
%
The data set includes the number of librarysubscriptions ($S_i$), the number of citations
$C_i$) and the subscription price for libraries
($P_i$). We want to estimate the model
($$\log(S_i)=\alpha_0+\alpha_1
left(P_i/C_i\right)+\epsilon_i,$$
\log\$P_i/C_i$ is the price per citation.
where </article|techdoc>
%<*techdoc>
%
We the define the price per citation, includein the data frame \texttt{Journals}
the variable <<Define variable>>=
$citeprice <- Journals$price/Journals$citations
Journals@
:
and estimate the model<<Estimate>>=
<- lm(log(subs)~log(citeprice),data=Journals)
result @
</techdoc>
%<*article|techdoc>
%in
The result with OLS standard errors is ~\ref{ta:est}.
Table<<Result,results=tex,echo=FALSE>>=
library(xtable)
xtable(summary(result),label="ta:est",
caption="Estimation results")
@
<<echo=FALSE>>=
write(round(coef(result)[[2]],2),file="coef.txt")
@
\bibliographystyle{abbrvnat}
\bibliography{example}
\end{document}</article|techdoc>
%<*bib>
%@Book{ Kleiber08,
= {Christian Kleiber and Achim Zeileis},
author = {Springer},
publisher = {2008},
year = {Applied Econometrics with {R}}}
title </bib>
%<*dump>
%<<Write DOCSTRIP installation file>>=
writeLines(
"\\input docstrip.tex
\\keepsilent
\\askforoverwritefalse
\\nopreamble
\\nopostamble
\\generate{
\\file{example_article.tex}%
{\\from{example_source.tex}{article}}
\\file{example_techdoc.tex}%
{\\from{example_source.tex}{techdoc}}
\\file{example.bib}%
{\\from{example_source.tex}{bib}}}
\\endbatchfile
",con="example.ins")
@
</dump> %
The compendium source file contains the following DOCSTRIP
tags (for
their association to files, see below):
article
associated with example_article.tex
, which contains the
code to the “advertisement” article,
techdoc
associated with example_techdoc.tex
, which contains the
code to the technical documentation,
bib
associated with example.bib
which contains the code to the
BibTeX data base file,
dump
associated with no file.
Note that the tags article
and techdoc
overlap with eachother but
not with bib
and dump
, which in turn are mutually exclusive. There
is no untagged material.
Lines \(2-15\) contain general information about the distributed project,
which could be more or less elaborate. Here it just states that the
project is distributed as a single source file and how the compendium
source file should be processed to get the relevant output
example_article.pdf
and example_techdoc.pdf
.
When the instructions are followed, Sweave
is run first on
example_source.rnw
creating the file example_source.tex
, in which
the Sweave
code chunks are replaced by the corresponding R output code
wrapped with LaTeX typesetting commands. One of the R functions used in
this Sweave session is writeLines()
(see the lines \(80-96\)) so that
the DOCSTRIP
installation file example.ins
is created before
DOCSTRIP
is run.
This file example_source.tex
is the DOCSTRIP
source file from which
the DOCSTRIP
utility, together with the installation file
example.ins
, creates the result files example_article.tex
,
example_techdoc.tex
and example.bib
. The two first result files
share some but not all code from the DOCSTRIP
source file. The result
files are then run with the LaTeX family of software (here pdflatex
and BibTeX) to create two PDF–files example_article.pdf
and
example_techdoc.pdf
. These are shown in
Figures 1–2.
Note that the entire bibliography (BibTeX) file is included on lines
\(73-77\) and extracted with DOCSTRIP
. Note also on line \(73\) that
unless the @ indicating a new bibliographical entry is not in column
\(1\) it is mixed up by Sweave
as a new text chunk and will be removed,
with errors as the result when BibTeX is run.4
The bibliography database file is common to both example_article.tex
and example_techdoc.tex
. Here the documents have the same single
reference. But in real implementations bibliographies would probably not
overlap completely. This way handling references is then preferable
since all bibliographical references occur only once in the source
file.5
In LaTeX cross references are handled by writing information to the
auxiliary file, which is read by later LaTeX runs. This handles
references to an object located both before and after the reference in
the LaTeX file. In Sweave
can be used to refer to R objects created
before but not after the reference is made. This is not exemplified
here. But since Sweave
and LaTeX are run sequentially an object can be
created by R, written to a file (see the code chunk on lines \(65-67\))
and then be used in the LaTeX run with the command \input{}
(see code
line \(29\)).
By making use of combinations of DOCSTRIP
and (say) writeLines()
and
by changing the order in which Sweave
and DOCSTRIP
are executed the
applications can be made more complex. Such examples may be found
Lundholm (2010a,c).6 Also, the use of DOCSTRIP
can
facilitate the creation of R packages as exemplified by the R data
package sifds available on
CRAN (Lundholm 2010b). Another type of example would be teaching material,
where this article may itself serve as an example. Apart from the
DOCSTRIP
installation file and a Bash script file all code used to
produce this article is contained in a single source file. The Bash
script, together with DOCSTRIP
, creates all example files including
the final PDF–files; that is, all example code is executed every time
this article is updated. So, if the examples are changed an update of
the article via the Bash script also updates the final PDF–files in
Figures 1–2.7
This article was written on a i486-pc-linux-gnu platform using R version
2.11.1 (2010-05-31), LaTeX\(2_\varepsilon\) (2005/12/01) and DOCSTRIP
2.5d (2005/07/29).
The compendium implementation presented here is partially developed in projects joint with Mahmood Arai, to whome I am owe several constructive comments on a previous version.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lundholm, "Implementing the Compendium Concept with Sweave and DOCSTRIP", The R Journal, 2011
BibTeX citation
@article{RJ-2011-013, author = {Lundholm, Michael}, title = {Implementing the Compendium Concept with Sweave and DOCSTRIP}, journal = {The R Journal}, year = {2011}, note = {https://rjournal.github.io/}, volume = {3}, issue = {2}, issn = {2073-4859}, pages = {16-21} }