Implementing the Compendium Concept with Sweave and DOCSTRIP

Abstract:

This article suggests an implementation of the compendium concept by combining Sweave and the LaTeX literate programming environment DOCSTRIP.

Cite PDF Tweet

Published

Dec. 1, 2011

Citation

Lundholm, 2011

Volume

Pages

3/2

16 - 21


1 Introduction

introduced compendiums as a mechanism to combine text, data, and auxiliary software into a distributable and executable unit, in order to achieve reproducible research.:

“…research papers with accompanying software tools that allow the reader to directly reproduce the result and employ the methods that are presented …”

provides an example of how the compendium concept can be implemented. The core of the implementation is a Sweave. See also and . source file. This source file is then packaged together with data and auxiliary software as an R package.

In this article I suggest an alternative implementation of the compendium concept combining Sweave with DOCSTRIP. The latter is the LaTeX literate programming environment used in package documentation and as an installation tool. and . DOCSTRIP has previously been mentioned in the context of reproducible research, but then mainly as a hard to use alternative to Sweave. and . Here, instead, DOCSTRIP and Sweave are combined.

Apart from the possibility to enjoy the functionality of Sweave and packages such as xtable etc the main additional advantages are that

in many applications almost all code and data can be kept in a single source file,

multiple documents (i.e., PDF files) can share the same Sweave code chunks.

This means not only that administration of an empirical project is facilitated but also that it becomes easier to achieve reproducible research. Since DOCSTRIP is a part of every LaTeX installation a Sweave user need not install any additional software. Finally, Sweave and DOCSTRIP can be combined to produce more complex projects such as R packages.

One example of the suggested implemention will be given. It contains R code common to more than one document; an article containing the advertisement of the research (using the terminology of ), and one technical documentation of the same research. In the following I assume that the details of Sweave are known to the readers of the R Journal. The rest of the article will (i) give a brief introduction to DOCSTRIP, (ii) present and comment the example and (iii) close with some final remarks.

2 DOCSTRIP

Suppose we have a source file the entire or partial content of which should be tangled into one or more result files. In order to determine which part of the source file that should be tangled into a certain result file (i) the content of the source file is tagged with none, one or more tags (tag–lists) and (ii) the various tag–lists are associated with the result files in a DOCSTRIP “installation” file.

There are several ways to tag parts of the source file:

A single line: Start the line with ‘%<tag-list>’.

Several lines, for instance one or more code or text chunks in Sweave terminology: On a single line before the first line of the chunk enter the start tag ‘%<*tag-list>’ and on a single line after the last line of the chunk the end tag ‘%</tag-list>’.

All lines: Lines that should be in all result files are left untagged.

tag-list is a list of tags combined with the Boolean operators ‘|’ (logical or), ‘&’ (logical and) and ‘!’ (logical negation). A frequent type of list would be, say, ‘tag1|tag2|tag3’ which will tangle the tagged material whenever tag1, tag2 or tag3 is called for into the result files these tags are associated with. The initial ‘%’ of the tags must be in the file’s first column or else the tag will not be recognised as a DOCSTRIP tag. Also, tags must be matched so a start tag with tag-list must be closed by an end tag with tag-list. This resembles the syntax of LaTeX environments rather than the Sweave syntax, where the end of a code or text chunk is indicated by the beginning of the next text or code chunk. Note also that tags cannot be nested.More exactly: Outer tags, which are described here, cannot be nested but inner tags can be nested with outer tags. See for details.

The following source file (docex.txt) exemplifies all three types of tags:

%<file1|file2>This line begins both files.
%<*file1>

This is the text that should be included in file1

%</file1>

This is the text to be included in both files

%<*file2>
This is the text that should be included in file2
%</file2>
%<*file1|file2>
Also text for both files.
%</file1|file2>

For instance, line 1 is a single line tagged file1 or file2, line 2 starts and line 6 ends a tag file1 and line 13 starts and line 15 ends a tag file1 or file2. Lines 79 are untagged.

The next step is to construct a DOCSTRIP installation file which associates each tag with one or more result files:

\input docstrip.tex
\keepsilent
\askforoverwritefalse
\nopreamble
\nopostamble
\generate{
\file{file1.txt}{\from{docex.txt}{file1}}
\file{file2.txt}{\from{docex.txt}{file2}}
}
\endbatchfile

Line 1 loads DOCSTRIP. Lines 25 contain options that basically tell DOCSTRIP not to issue any messages, to write over any existing result files and not to mess up the result files with pre– and post-ambles.Pre– and postambles are text lines that are starting with a comment character. Since result files may be processed by software using different comment characters some care is needed to use pre– and postambles constructed by DOCSTRIP. See how to set up pre– and postambles that are common to all result files from a given installation file. The action takes place on lines 69 within the command ‘\generate{}’, where lines 78 associate the tags file1 and file2 in the source file docex.txt with the result files file1.txt and file2.txt.From the example one infer that multiple source files are possible, altough the compendium implementation discussed later in most cases would have only one.

We name this file docex.ins, where .ins is the conventional extension for DOCSTRIP installation files. DOCSTRIP is then invoked with

latex docex.ins

A log–file called docex.log is created from which we here show the most important parts (lines 5667):

Generating file(s) ./file1.txt ./file2.txt 
\openout0 = `./file1.txt'.

\openout1 = `./file2.txt'.


Processing file docex.txt (file1) -> file1.txt
                          (file2) -> file2.txt
Lines  processed: 15
Comments removed: 0
Comments  passed: 0
Codelines passed: 8 

We see that two result files are created from the 15 lines of code in the source file. First file1.txt;

This line begins both files.

This is the text that should be included in file1


This is the text to be included in both files

Also text for both files.

and file2.txt;

This line begins both files.

This is the text to be included in both files

This is the text that should be included in file2
Also text for both files.

Note that some lines are blank in both the original source file and the result files. Disregarding these the two result files together have 8 lines of code. The untagged material in lines 79 in the source files is tangled into both result files, the blank lines 7 and 8 in the source file result in the blank lines 5 and 7 in file1.txt and the blank lines 2 and 4 in file2.txt.

3 Example

In the following a simple example will be given of how DOCSTRIP can be combined with Sweave to implement the compendium concept. The starting point is a “research problem” which involves loading some data into R, preprocessing the data, conducting an estimation and presenting the result. The purpose is to construct a single compendium source file which contains the code used to create (i) an “article” PDF–file which will provide a brief account of the test and (ii) a “technical documentation” PDF–file which gives a more detailed description of loading and preprocessing data and the estimation. The source file also contains the code of a BibTeX databse file and the DOCSTRIP installation file. Although this source file is neither a LaTeX file or a Sweave file I will use the extension .rnw since it first run through Sweave. Here we simplify the example by using data from an R package, but if the data set is not too large it could be a part of the source file.

We can think of the “article” as the “advertisement” intended for journal publication and the “technical documentation” as a more complete account of the actual research intended to be available on (say) a web place. However, tables, graphs and individual statistics should originate from the same R code so whenever Sweave is executed these are updated in both documents. There may also be common text chunks and when they are changed in the source file, both documents are updated via the result files.

The example code in the file example_source.rnw is as follows:

%<*article|techdoc>
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Author: Michael Lundholm
%% Email:  michael.lundholm@ne.su.se
%% Date:   2010-09-06
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% The project files consists of
%%  * example_source.rnw (THIS FILE)
%% To create other files execute
%%   R CMD Sweave example_source.rnw
%%   latex example.ins
%%   pdflatex example_techdoc.tex
%%   bibtex example_techdoc
%%   pdflatex example_article.tex
%%   bibtex example_article
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass{article}
\usepackage{Sweave,amsmath,natbib}
%</article|techdoc>
%<article>\title{Article}
%<techdoc>\title{Technical documentation}
%<*article|techdoc>
\author{M. Lundholm}
\begin{document}
\maketitle
This note replicates the \citet[p. 56ff]{Kleiber08}
estimation of a price per citation elasticity of
library subscriptions for economics journals equal
to $\input{coef.txt}$.
%</article|techdoc>
%<*techdoc>
The data, available in R package AER on CRAN, is loaded:
<<Loading data>>=
data("Journals",package="AER")
@
%</techdoc>
%<*article|techdoc>
The data set includes the number of library
subscriptions ($S_i$), the number of citations
($C_i$)  and the subscription price for libraries
($P_i$). We want to estimate the model
$$\log(S_i)=\alpha_0+\alpha_1
\log\left(P_i/C_i\right)+\epsilon_i,$$
where $P_i/C_i$ is the price per citation.
%</article|techdoc>
%<*techdoc>
We the define the price per citation, include
the variable in the data frame \texttt{Journals}
<<Define variable>>=
Journals$citeprice <- Journals$price/Journals$citations
@
and estimate the model:
<<Estimate>>=
result <- lm(log(subs)~log(citeprice),data=Journals)
@
%</techdoc>
%<*article|techdoc>
The result with OLS standard errors is in
Table~\ref{ta:est}.
<<Result,results=tex,echo=FALSE>>=
library(xtable)
xtable(summary(result),label="ta:est",
caption="Estimation results")
@
<<echo=FALSE>>=
write(round(coef(result)[[2]],2),file="coef.txt")
@
\bibliographystyle{abbrvnat}
\bibliography{example}
\end{document}
%</article|techdoc>
%<*bib>
 @Book{ Kleiber08,
author = {Christian Kleiber and Achim Zeileis},
publisher = {Springer},
year = {2008},
title = {Applied Econometrics with {R}}}
%</bib>
%<*dump>
<<Write DOCSTRIP installation file>>=
writeLines(
"\\input docstrip.tex
\\keepsilent
\\askforoverwritefalse
\\nopreamble
\\nopostamble
\\generate{
\\file{example_article.tex}%
{\\from{example_source.tex}{article}}
\\file{example_techdoc.tex}%
{\\from{example_source.tex}{techdoc}}
\\file{example.bib}%
{\\from{example_source.tex}{bib}}}
\\endbatchfile
",con="example.ins")
@
%</dump>

The compendium source file contains the following DOCSTRIP tags (for their association to files, see below):

Note that the tags article and techdoc overlap with eachother but not with bib and dump, which in turn are mutually exclusive. There is no untagged material.

graphic without alt text
Figure 1: example_article.pdf.

Lines 215 contain general information about the distributed project, which could be more or less elaborate. Here it just states that the project is distributed as a single source file and how the compendium source file should be processed to get the relevant output example_article.pdf and example_techdoc.pdf.

When the instructions are followed, Sweave is run first on example_source.rnw creating the file example_source.tex, in which the Sweave code chunks are replaced by the corresponding R output code wrapped with LaTeX typesetting commands. One of the R functions used in this Sweave session is writeLines() (see the lines 8096) so that the DOCSTRIP installation file example.ins is created before DOCSTRIP is run.

This file example_source.tex is the DOCSTRIP source file from which the DOCSTRIP utility, together with the installation file example.ins, creates the result files example_article.tex, example_techdoc.tex and example.bib. The two first result files share some but not all code from the DOCSTRIP source file. The result files are then run with the LaTeX family of software (here pdflatex and BibTeX) to create two PDF–files example_article.pdf and example_techdoc.pdf. These are shown in Figures 12.

graphic without alt text
Figure 2: example_techdoc.pdf.

Note that the entire bibliography (BibTeX) file is included on lines 7377 and extracted with DOCSTRIP. Note also on line 73 that unless the @ indicating a new bibliographical entry is not in column 1 it is mixed up by Sweave as a new text chunk and will be removed, with errors as the result when BibTeX is run.The tag dump is a safeguard against that this material is allocated to some result file by DOCSTRIP; in this case to the BibTeX data base file.

The bibliography database file is common to both example_article.tex and example_techdoc.tex. Here the documents have the same single reference. But in real implementations bibliographies would probably not overlap completely. This way handling references is then preferable since all bibliographical references occur only once in the source file.One alternative would be to replace the command \bibliography{example} on line 69 with the content of example_article.bbl and example_techdoc.bbl appropriately tagged for DOCSTRIP. However, this procedure would require an “external” bibliography data base file. The problem then is that each time the data base is changed, manual updating of the parts of example_source.rnw that creates example_article.bbl and example_techdoc.bbl is required. Creating the bibliography data base file via DOCSTRIP makes this manual updating unnecessary.

In LaTeX cross references are handled by writing information to the auxiliary file, which is read by later LaTeX runs. This handles references to an object located both before and after the reference in the LaTeX file. In Sweave can be used to refer to R objects created before but not after the reference is made. This is not exemplified here. But since Sweave and LaTeX are run sequentially an object can be created by R, written to a file (see the code chunk on lines 6567) and then be used in the LaTeX run with the command \input{} (see code line 29).

4 Final comments

By making use of combinations of DOCSTRIP and (say) writeLines() and by changing the order in which Sweave and DOCSTRIP are executed the applications can be made more complex. Such examples may be found .An early attempt to implement the ideas presented in this article can be found in . Also, the use of DOCSTRIP can facilitate the creation of R packages as exemplified by the R data package sifds available on CRAN . Another type of example would be teaching material, where this article may itself serve as an example. Apart from the DOCSTRIP installation file and a Bash script file all code used to produce this article is contained in a single source file. The Bash script, together with DOCSTRIP, creates all example files including the final PDF–files; that is, all example code is executed every time this article is updated. So, if the examples are changed an update of the article via the Bash script also updates the final PDF–files in Figures 12.The compendium source files of projects mentioned in this paragraph, including this article, can be found at http://people.su.se/~lundh/projects/.

5 Colophone

This article was written on a i486-pc-linux-gnu platform using R version 2.11.1 (2010-05-31), LaTeX2ε (2005/12/01) and DOCSTRIP 2.5d (2005/07/29).

6 Acknowledgement

The compendium implementation presented here is partially developed in projects joint with Mahmood Arai, to whome I am owe several constructive comments on a previous version.


CRAN packages used

xtable, sifds

CRAN Task Views implied by cited packages

ReproducibleResearch

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

Footnotes

  1. More exactly: Outer tags, which are described here, cannot be nested but inner tags can be nested with outer tags. See for details.[↩]
  2. Pre– and postambles are text lines that are starting with a comment character. Since result files may be processed by software using different comment characters some care is needed to use pre– and postambles constructed by DOCSTRIP. See how to set up pre– and postambles that are common to all result files from a given installation file.[↩]
  3. From the example one infer that multiple source files are possible, altough the compendium implementation discussed later in most cases would have only one.[↩]
  4. The tag dump is a safeguard against that this material is allocated to some result file by DOCSTRIP; in this case to the BibTeX data base file.[↩]
  5. One alternative would be to replace the command \bibliography{example} on line 69 with the content of example_article.bbl and example_techdoc.bbl appropriately tagged for DOCSTRIP. However, this procedure would require an “external” bibliography data base file. The problem then is that each time the data base is changed, manual updating of the parts of example_source.rnw that creates example_article.bbl and example_techdoc.bbl is required. Creating the bibliography data base file via DOCSTRIP makes this manual updating unnecessary.[↩]
  6. An early attempt to implement the ideas presented in this article can be found in .[↩]
  7. The compendium source files of projects mentioned in this paragraph, including this article, can be found at http://people.su.se/~lundh/projects/.[↩]

References

M. Arai, J. Karlsson, and M. Lundholm. On fragile grounds: A replication of Are Muslim immigrants different in terms of cultural integration? Accepted for publication in the Journal of the European EconomicAssociation, 2009. URL http://www.eeassoc.org/index.php?site=JEEA&page=55.
J. B. Buckheit and D. L. Donoho. WaveLab and reproducible research. In A. Antoniadis; G. Oppenheim editors Wavelets andstatistics Lecture notes in statistics 103 pages Springer Verlag, 1995.
R. Gentleman and D. T. Lang. Statistical analyses and reproducible research. Journal of Computational; Graphical Statistics,16:, 2007. URL http://pubs.amstat.org/doi/pdfplus/10.1198/106186007X178663.
R. Gentleman. Reproducible research: A bioinformatics case study. 2005. URL http://www.bioconductor.org/docs/papers/2003/Compendium/Golub.pdf.
M. Goossens, F. Mittelbach, and A. Samarin. The LaTeX Companion. Ad-di-son-Wes-ley Reading MA USA second edition, 1994.
T. Hothorn. Praktische aspekte der reproduzierbarkeit statistischer analysen in klinischen studien. Institut f"ur Medizininformatik Biometrie und Epidemiologie,Friedrich-Alexander-Universit"at Erlangen-N"urnberg  hothorn/talks/AV.pdf, 2006. URL http://www.imbe.med.uni-erlangen.de/.
F. Leisch. Sweave User Manual, .  leisch/Sweave/Sweave-manual.pdf R version 2.7.1, 2008. URL http://www.stat.uni-muenchen.de/.
F. Leisch. Sweave: Dynamic generation of statistical reports using literate data analysis. In W. H"ardle; y. .  Bernd R"onz pages = ,editors Compstat – Proceedings in Computational Statistics, 2002.
M. Lundholm. Are inflation forecasts from major swedish forecasters biased? Research paper in economics :10 Department of Economics,Stockholm University natexlaba, 2010a. URL http://swopec.hhs.se/sunrpe/abs/sunrpe2010_0010.htm.
M. Lundholm. sifds: Swedish inflation forecast data set, natexlabc. R package version 0.9, 2010b. URL http://www.cran.r-project.org/web/packages/sifds/index.html.
M. Lundholm. Sveriges riksbank’s inflation interval forecasts –2005. Research paper in economics :11 Department of Economics,Stockholm University natexlabb, 2010c. URL http://swopec.hhs.se/sunrpe/abs/sunrpe2010_0010.htm.
E. Meredith and J. S. Racine. Towards reproducible econometric research: The Sweave framework. Journal of Applied Econometrics Published Online: 12 Nov, 2008. URL http://dx.doi.org/10.1002/jae.1030.
F. Mittelbach, D. Duchier, J. Braams, M. Wolińsky, and M. Wooding. The DOCSTRIP program. Version 2.5d, 2005. URL http://tug.ctan.org/tex-archive/macros/latex/base/.
B. Rising. Reproducible research: Weaving with Stata. StataCorp LP, 2008. URL http://www.stata.com/meeting/italy08/rising_2008.pdf.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Lundholm, "Implementing the Compendium Concept with Sweave and DOCSTRIP", The R Journal, 2011

BibTeX citation

@article{RJ-2011-013,
  author = {Lundholm, Michael},
  title = {Implementing the Compendium Concept with Sweave and DOCSTRIP},
  journal = {The R Journal},
  year = {2011},
  note = {https://rjournal.github.io/},
  volume = {3},
  issue = {2},
  issn = {2073-4859},
  pages = {16-21}
}