Reproducible research and data archiving are increasingly important issues in research involving statistical analyses of quantitative data. This article introduces the dvn package, which allows R users to publicly archive datasets, analysis files, codebooks, and associated metadata in Dataverse Network online repositories, an open-source data archiving project sponsored by Harvard University. In this article I review the importance of data archiving in the context of reproducible research, introduce the Dataverse Network, explain the implementation of the dvn package, and provide example code for archiving and releasing data using the package.
Reproducible research involves the careful, annotated preservation of data, analysis code, and associated files, such that statistical procedures, output, and published results can be directly and fully replicated (see, for example, King 1995; Boyer 2003). R provides an increasingly large set of tools for engaging in reproducible research practices. Perhaps most prominent is knitr (Xie 2013), which expands upon Sweave (Leisch 2002) to help R users produce fully reproducible research reports for a variety of markup languages. Gandrud (2013b) and others have advocated for the widespread use of reproducible research tools, but a major shortcoming of the existing set of tools is a means of easily archiving reproducible research files for use by others. While it is easy to privately archive files for reproducing one’s own work later on, making study files publicly available is not something readily supported by R or other existing tools.
A reproducible research workflow is most valuable when it allows others to replicate results and reproduce published results precisely. Thus complete records of all data and analyses need to be publicly and persistently available via a stable, freely available archive. Gandrud (2013a,b) has advocated for the use of Dropbox and GitHub for fulfilling this archiving role. Both sites are supported, to varying extents, by existing R packages: rDrop (Ram 2012) for Dropbox and rgithub (Scheidegger 2013) for GitHub. However, Dropbox is not designed for persistent archiving, depends on assumptions about the future availability of a proprietary, closed-source web service, and offers no platform for finding archived data (thus introducing a further dependence on the publication of a Dropbox link somewhere else on the web). GitHub, which integrates with the git version-control system (which many R users may already invoke as part of their existing workflow), is helpful for supporting an ongoing (and possibly collaborative) reproducible research workflow but is not designed to be a persistent data archive.
Thus researchers remain in need of a service dedicated to the persistent
preservation of data with sophisticated metadata support to allow others
to easily find, understand, and use archived files. One such option is
The Dataverse Network, a free-to-use, open-source data archive sponsored
and developed by the Institute for Quantitative Social Science at
Harvard University (Altman et al. 2001; King 2007).
The Dataverse Network is a server application that hosts collections of
studies, called dataverses.
Each study archived in a dataverse contains, at a minimum, an identifying title, but may additionally include an array of metadata and files. This means The Dataverse Network can be used to store not just a free-standing data file, but associated files in almost any format and one might include files such as data, codebooks, analysis replication files, statistical packages, questionnaires, experimental materials, and so forth. Data uploaded to a dataverse is converted to a generic tabular format and can be subsequently downloaded into one of several common formats (e.g., Stata, S-Plus, R, and tab-delimited).
Once created, a study is given a global identifier in the form of either
a Handle,
Archiving data and files using The Dataverse Network also means that
those files can be made available to others. The Dataverse Network
settings offer fine-grained control over access to archived studies,
meaning a study collection can be regulated anywhere from completely
publicly accessible, to accessible only to registered users, to
accessible only to whitelisted users. Public data then becomes easily
searchable by others based on any of the metadata associated with the
study files, a major advantage over archiving on any other website.
The dvn package provides a lightweight wrapper for two Application
Programming Interfaces (APIs) provided by The Dataverse Network. The
first API — called the Data Sharing API — is essentially a search
utility for locating existing studies within a Dataverse Network
instance using simple HTTP GET requests. The second API — called the
Data Deposit API — is a bit more complicated. It is built on the SWORD
(Simple Web-service Offering Repository Deposit) protocol,
Thus dvn adds to an increasingly large number of packages that connect
R to web-based services,
dvn will remain under active development for the foreseeable future and will track any modifications made to The Dataverse Network APIs. Bug reports, code pull requests, and other suggestions are welcome on the dvn homepage at The next two sections describe functionality of the package in its current form.
The search functionality of dvn can be used to easily locate archived data in any public Dataverse Network. dvn defaults to providing access to The Harvard Dataverse Network (located at, but this can be changed in each function call or globally using options:
# use Odum Institute Dataverse
options(dvn = '')
# use Harvard IQSS Dataverse
options(dvn = '')
for any valid Dataverse Network.dvSearchField
provides a list of available metadata search fields.
These can be used to search for studies by various authors or on various
topics using dvSearch
(which accepts a boolean logic):
dvSearch(list(authorName = "leeper"))
dvSearch(list(title = "Denmark", title = "Sweden"), boolean = "OR")
Calling dvSearch
with a single character argument searches all
metadata fields simultaneously using a boolean OR logic. The result is
(presently) a one-column dataframe listing objectId
values, which
represent the global identifier for each study.objectId
> dvSearch('puppies')
2 search results returned
1 hdl:1902.1/21521
2 hdl:1902.1/21522
An objectId
can then be used to retrieve detailed metadata available
for a study. Metadata is available in two formats: Dublic Core (DC) and
Data Documentation Initiative (DDI).dvMetadataFormats
dvMetadata("hdl:1902.1/21964") # return DDI (by default)
dvMetadata("hdl:1902.1/21964", format.type="oai_dc") # return DC
Currently dvMetadata
returns metadata as a character string with S3
class "dvMetadata"
because both DC and DDI output can be quite
extensive and are better viewed in a text editor than in the R console.
For similar reasons, dvn does not fully parse the DDI or DC XML.
Instead, the wrapper function dvExtractFileIds
extracts relevant
information from the DDI formatted metadata.dvTermsOfUse(dvMetadata(objectid))
files <- dvExtractFileIds(dvMetadata("hdl:1902.1/21964"))
> files[, c('fileName', 'fileId')]
fileName fileId
1 study2-replication-analysis.r 2341713
2 study1-replication-analysis.r 2341709
3 coefpaste.r 2341888
4 expResults.r 2341889
5 Study 2 Codebook.docx 2341712
6 study2-data-final-2012-06-08.csv 2341711
7 study1-data-final-2012-06-08.csv 2341710
8 Study 2 2341714
9 Study 1 2341715
With a fileId
returned by dvExtractFileIds
, one can see the formats
available for a file (e.g., data can be downloaded in a number of
formats), using dvDownloadInfo
> dvDownloadInfo(files$fileId[1])
File Name: study2-replication-analysis.r
File ID: 2341713
File Type: text/plain; charset=US-ASCII
File Size: 15699
Authentication: anonymous
Direct Access? false
Terms of Use apply.
Access Services:
serviceName serviceArgs contentType
1 termsofuse TermsOfUse=true text/plain
2 bundleTOU package=WithTermsOfUse application/zip
1 Terms of Use/Access Restrictions associated with the data file
2 Data File and the Terms of Use in a Zip archive
If allowed by Terms of Use, dvDownload
can be used to download a file
directly into memory. Otherwise the output of dvDownload
will advise
the user to download the data through a browser using the URI returned
by dvExtractFileIds
> dvDownload(files$fileId[1])
Error in dvDownload(files$fileId[1]) :
Terms of Use apply.
Data cannot be accessed directly...
try browsing URI from dvExtractFileIds(dvMetadata())
> files$URI[1]
While searching for extant data files is a helpful feature to have within R, the dvn wrappers for the Data Deposit API offer much more useful tools for archiving a reproducible research project. The basic workflow involves functions to:
create a study using appropriate metadata (dvCreateStudy
add one or more files to that study (dvAddFile
release the study to the public (dvReleaseStudy
Use of the Data Deposit functions requires authentication, which is currently provided by HTTP basic authentication, relying on the username and password associated with a Dataverse Network user account. The username and password can be included atomically in each function call or stored globally (to conserve typing and allow the sharing of code without credentials embedded in each function call):
options(dvn.user = "username")
options(dvn.pwd = "password")
As a simple check of authentication, one can call dvServiceDoc()
no arguments, which (if authentication is successful) returns a brief
statement confirming the Dataverse implementation and user account in
use along with a list of available dataverse collections, or otherwise
reports an HTTP 403 (Forbidden) code.
To create a study, we simply need to call dvCreateStudy
with some
associated metadata in Dublin Core format. We can either build a
metadata file manually or use dvBuildMetadata
to build that XML for
us. The only required field is “title.” The built metadata can then be
supplied to dvCreateStudy
, additionally specifying a named dataverse
to which the user has access:dvServiceDoc
m <- dvBuildMetadata(title = "My Study")
s <- dvCreateStudy("mydataverse", m)
If successful, the response is a simple summary of the created study.
Included in the response is the new study’s objectId
, which uniquely
identifies the study. We can use either the objectId
or the return
value of dvCreateStudy
(an object of class "dvStudyAtom"
) to add
files and release the study.
We can add files in a number of ways and in a number of formats. For
example, we can add one or more files simply by specifying their
filenames in dvAddFile
. The filename is used to list the file in the
study. We can also send multiple files in a compressed zip directory,
which The Dataverse Network application will automatically unpack.category
argument additionally allows you to group
files into categories, such as “data” and “code.”
# add files and release study using `objectid`
dvAddFile(s, "")
# or add multiple files:
dvAddFile(s, c("file1.csv", "file2.txt"))
# or add R dataframes as files:
mydf <- data.frame(x = 1:10, y = 11:20)
dvAddFile(s, dataframe = "mydf")
Adding files is non-destructive, so we can add files all at once in a
single .zip directory or we can add them sequentially. The Dataverse
Network does not overwrite an existing file. Attempting to do so will
result in an error and dvAddFile
will return NULL
. If a file fails
to upload successfully, dvAddFile
will return an error, and if a data
file uploads successfully but then fails to decompress or is in an
unrecognizable format, you will receive an email to the account
affiliated with your dataverse informing you of the failure.
With metadata and files added, we can release the study using
. Before doing so, it may be reasonable to check the
study using dvStudyStatement
to see a short summary of study contents:
> dvStudyStatement(s)
Study author: Unknown
Study title: My Study
ObjectId: doi:10.7910/DVN/ILAJO
Study URI:
Last updated: 2013-12-02T09:56:07.298Z
Status: DRAFT
Locked? false
type updated fileId
1 text/tab-separated-values 2013-12-02T09:56:16.227Z 2349641
The above example output shows a variety of information, including the
lack of a study author, the state of the study as DRAFT
(meaning the
current version is unreleased), and a list of the included files (we see
the mydf
object uploaded above). If satisfied, we can release the
When a study is released, it becomes immediately accessible via its
persistent Handle or DOI. If changes are made to the dataverse (either
metadata or files), these remain in draft form until the dataverse is
released again. Each release is persistently available unless the entire
study collection is deleted.
Before releasing a study, we can delete it from The Dataverse Network
server using dvDeleteStudy
. Once a study is released, it cannot be
deleted, but access to it can be revoked. Thus, calling dvDeleteStudy
only deletes studies that are unreleased. Calling dvDeleteStudy
on a
released study revokes public access, but a persistent page will remain
at the study URL noting that a study was previously released. Similarly,
it is possible to delete files using dvDeleteFile
with an argument
specifying the fileId
(which we can extract from dvStudyStatement
> d <- dvStudyStatement(s)
> d$files$fileId
[1] "2349642" "2349641"
> dvDeleteFile("2349642")
Operation appears to have succeeded.
[1] ""
If a file deletion is successful, dvn will report a success message
and dvDeleteFile
will return an empty character string.
We can also modify a study’s metadata using dvEditStudy
, supplied with
a metadata listing (e.g., as returned by dvBuildMetadata
> m2 <- dvBuildMetadata(title = 'An Actual Title', creator = 'Me', date = '2013')
> dvEditStudy(s, m2)
Citation: Me, 2013, "An Actual Title", V1
ObjectId: doi:10.7910/DVN/ILAJO
Study URI:
Generated by: 2.0
The response reflects the updated metadata. If changes are made to a
study (files are added or deleted, or metadata is modified), those
changes will be visible via dvn as a DRAFT
study version, but will
not be public. To make them public, simply release the study again with
. Both study versions will remain accessible via The
Dataverse Network web browser interface.
Once we’ve created one or more studies (regardless of whether they are
released), we can view them using dvUserStudies
for a named dataverse
or from a dvServiceDoc
# or:
Thus while we only need three core functions (dvCreateStudy
, and dvReleaseStudy
) to complete a reproducible research
workflow, the other functions supplied by dvn allow a great degree of
control over the appearance and contents of a study.
As Gandrud (2013b) has demonstrated, R is a powerful tool in the reproducible research workflow. Yet a missing element of that workflow to-date has been the easy ability to store research files in a persistent, public data repository designed for reproducible research. The Dataverse Network is a free, open-source service explicitly dedicated to permanently storing data, with implementations hosted by a variety of institutions (mostly research universities) offering additional choice over whom to trust with the persistent storage of study records. By automatically providing study records a unique URI, data stored using The Dataverse Network has a version-controlled and persistent identifier that can be cited by subsequent users of the data, continuing the reproducible workflow process beyond the phase of data production and initial analysis. With the popularity of The Dataverse Network increasing, it is a logical choice for archiving one’s reproducible data and analysis files. dvn thus provides R users with the tools necessary to quickly and easily integrate data archiving into the reproducible research workflow.
dvn, knitr, rfigshare, RCurl, XML, rdryad, OAIHarvester
ReproducibleResearch, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
argument additionally allows you to group
files into categories, such as “data” and “code.”[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Leeper, "Archiving Reproducible Research with R and Dataverse", The R Journal, 2014
BibTeX citation
@article{RJ-2014-015, author = {Leeper, Thomas J.}, title = {Archiving Reproducible Research with R and Dataverse}, journal = {The R Journal}, year = {2014}, note = {}, volume = {6}, issue = {1}, issn = {2073-4859}, pages = {151-158} }