Reproducible research and data archiving are increasingly important issues in research involving statistical analyses of quantitative data. This article introduces the dvn package, which allows R users to publicly archive datasets, analysis files, codebooks, and associated metadata in Dataverse Network online repositories, an open-source data archiving project sponsored by Harvard University. In this article I review the importance of data archiving in the context of reproducible research, introduce the Dataverse Network, explain the implementation of the dvn package, and provide example code for archiving and releasing data using the package.
Reproducible research involves the careful, annotated preservation of data, analysis code, and associated files, such that statistical procedures, output, and published results can be directly and fully replicated (see, for example, King 1995; Boyer 2003). R provides an increasingly large set of tools for engaging in reproducible research practices. Perhaps most prominent is knitr (Xie 2013), which expands upon Sweave (Leisch 2002) to help R users produce fully reproducible research reports for a variety of markup languages. Gandrud (2013b) and others have advocated for the widespread use of reproducible research tools, but a major shortcoming of the existing set of tools is a means of easily archiving reproducible research files for use by others. While it is easy to privately archive files for reproducing one’s own work later on, making study files publicly available is not something readily supported by R or other existing tools.
A reproducible research workflow is most valuable when it allows others to replicate results and reproduce published results precisely. Thus complete records of all data and analyses need to be publicly and persistently available via a stable, freely available archive. Gandrud (2013a,b) has advocated for the use of Dropbox and GitHub for fulfilling this archiving role. Both sites are supported, to varying extents, by existing R packages: rDrop (Ram 2012) for Dropbox and rgithub (Scheidegger 2013) for GitHub. However, Dropbox is not designed for persistent archiving, depends on assumptions about the future availability of a proprietary, closed-source web service, and offers no platform for finding archived data (thus introducing a further dependence on the publication of a Dropbox link somewhere else on the web). GitHub, which integrates with the git version-control system (which many R users may already invoke as part of their existing workflow), is helpful for supporting an ongoing (and possibly collaborative) reproducible research workflow but is not designed to be a persistent data archive.
Thus researchers remain in need of a service dedicated to the persistent preservation of data with sophisticated metadata support to allow others to easily find, understand, and use archived files. One such option is The Dataverse Network, a free-to-use, open-source data archive sponsored and developed by the Institute for Quantitative Social Science at Harvard University (Altman et al. 2001; King 2007).1 The next section introduces The Dataverse Network and the remainder of the article describes how to utilize the archive directly from R via the dvn package.
The Dataverse Network is a server application that hosts collections of studies, called dataverses.2 As an open-source application, The Dataverse Network has many instances (i.e., hosted by different institutions) and each implementation, e.g., the Harvard IQSS Dataverse Network, hosts many dataverses controlled by individuals, institutions, and academic journals. Though there are few The Dataverse Network instances (about 10 worldwide), each of these — especially those hosted by Harvard University and the Odum Institute at the University of North Carolina–Chapel Hill — hosts thousands of individual dataverses containing a collective hundreds of thousands of study listings. Consequently, The Dataverse Network is an increasingly prominent repository of data and associated files and is the default archive for many journals that currently require public archiving of replication files as a condition of publication. To use The Dataverse Network, one needs to create an account. To release studies, one also needs to create a personal dataverse collection associated with the account. Accounts are tied to specific The Dataverse Network instances, so in this article we will focus for simplicity’s sake on the The Harvard Dataverse Network.
Each study archived in a dataverse contains, at a minimum, an identifying title, but may additionally include an array of metadata and files. This means The Dataverse Network can be used to store not just a free-standing data file, but associated files in almost any format and one might include files such as data, codebooks, analysis replication files, statistical packages, questionnaires, experimental materials, and so forth. Data uploaded to a dataverse is converted to a generic tabular format and can be subsequently downloaded into one of several common formats (e.g., Stata, S-Plus, R, and tab-delimited).
Once created, a study is given a global identifier in the form of either a Handle,3 or DataCite DOI,4 digital object identifiers that uniquely and globally identify the study. Studies are version controlled when new files are added or when metadata is modified, and previous versions remain persistently accessible. Furthermore, when data files are modified, the study identifier is updated with additional version information in the form of Universal Numeric Fingerprint (UNF) (Altman and King 2007; Altman 2008). The UNF is a format-independent hash of a data file, which provides a unique identifier for a dataset without revealing the content of that dataset, such that any change to the underlying data yields a new UNF. This means that data stored in a dataverse can be cited (e.g., in scholarly publications) not only by their handle but also by their specific version using the UNF.5 Reproducible research efforts can thus be enhanced by use not only of the same named dataset but the exact same version of that dataset, a major advantage over storing data in other types of archive.6 Via a UNF, data users can thus check a dataset they possess against one cited in a research publication or stored in a dataverse.7
Archiving data and files using The Dataverse Network also means that those files can be made available to others. The Dataverse Network settings offer fine-grained control over access to archived studies, meaning a study collection can be regulated anywhere from completely publicly accessible, to accessible only to registered users, to accessible only to whitelisted users. Public data then becomes easily searchable by others based on any of the metadata associated with the study files, a major advantage over archiving on any other website.8 The supported metadata elements include almost all imaginable aspects of a data collection, include those provided by the Dublin Core standard,9 such as creator, subject, description, date, type, rights,10 citation, etc. Additional metadata elements are also supported and Dataverse additionally supplies metadata in the Data Documentation Initiative (DDI) format,11 a comprehensive metadata schema for quantitative data, especially for the social sciences.
The dvn package provides a lightweight wrapper for two Application Programming Interfaces (APIs) provided by The Dataverse Network. The first API — called the Data Sharing API — is essentially a search utility for locating existing studies within a Dataverse Network instance using simple HTTP GET requests. The second API — called the Data Deposit API — is a bit more complicated. It is built on the SWORD (Simple Web-service Offering Repository Deposit) protocol,12 an open-source standard for the deposit of digital records.13 This API relies on HTTP requests to GET, PUT, POST, and DELETE resources on a Dataverse server. Access to both APIs is implemented through RCurl (Temple Lang 2012a) and responses are parsed with XML (Temple Lang 2012b). From the user perspective, dvn masks the distinction between the two APIs, offering functions both for conducting data searches and for archiving data, code, and associated metadata.
Thus dvn adds to an increasingly large number of packages that connect R to web-based services,14 such as those built by the ROpenSci project.15 Perhaps the most proximate existing R package to dvn is rfigshare (Boettiger et al. 2013), which allows R users to persistently store files on http://figshare.com/. rdryad (Chamberlain et al. 2013) and OAIHarvester (Hornik 2013) — which can be used to harvest metadata records from http://datadryad.org/ or, in the latter case, any public repository that supports the OAI-PMH standard16 — share commonalities with The Dataverse Network Data Sharing API but neither allows users to upload their own files.
dvn will remain under active development for the foreseeable future and will track any modifications made to The Dataverse Network APIs. Bug reports, code pull requests, and other suggestions are welcome on the dvn homepage at https://github.com/rOpenSci/dvn. The next two sections describe functionality of the package in its current form.
The search functionality of dvn can be used to easily locate archived data in any public Dataverse Network. dvn defaults to providing access to The Harvard Dataverse Network (located at https://thedata.harvard.edu/dvn/), but this can be changed in each function call or globally using options:
# use Odum Institute Dataverse
options(dvn = 'https://arc.irss.unc.edu/dvn/')
# use Harvard IQSS Dataverse
options(dvn = 'https://thedata.harvard.edu/dvn/')
for any valid Dataverse Network.17 With a Dataverse Network selected,
we can easily search that network for studies using any metadata field.
dvSearchField
provides a list of available metadata search fields.
These can be used to search for studies by various authors or on various
topics using dvSearch
(which accepts a boolean logic):
dvSearch(list(authorName = "leeper"))
dvSearch(list(title = "Denmark", title = "Sweden"), boolean = "OR")
dvSearch("Puppies")
Calling dvSearch
with a single character argument searches all
metadata fields simultaneously using a boolean OR logic. The result is
(presently) a one-column dataframe listing objectId
values, which
represent the global identifier for each study.18 Here is an example
using our final search from above:
> dvSearch('puppies')
2 search results returned
objectId1 hdl:1902.1/21521
2 hdl:1902.1/21522
An objectId
can then be used to retrieve detailed metadata available
for a study. Metadata is available in two formats: Dublic Core (DC) and
Data Documentation Initiative (DDI).19 DC is an almost universally
adopted, minimalist metadata schema widely used in information science.
DDI is a more sophisticated format designed initially for social science
data applications that can include considerably more metadata
information.20 By default, Dataverse and dvn return the more
comprehensive DDI format. To request DC, one needs to specify it using
the format.type
argument:
dvMetadata("hdl:1902.1/21964") # return DDI (by default)
dvMetadata("hdl:1902.1/21964", format.type="oai_dc") # return DC
Currently dvMetadata
returns metadata as a character string with S3
class "dvMetadata"
because both DC and DDI output can be quite
extensive and are better viewed in a text editor than in the R console.
For similar reasons, dvn does not fully parse the DDI or DC XML.
Instead, the wrapper function dvExtractFileIds
extracts relevant
information from the DDI formatted metadata.21 Here’s an example:
<- dvExtractFileIds(dvMetadata("hdl:1902.1/21964"))
files > files[, c('fileName', 'fileId')]
fileName fileId1 study2-replication-analysis.r 2341713
2 study1-replication-analysis.r 2341709
3 coefpaste.r 2341888
4 expResults.r 2341889
5 Study 2 Codebook.docx 2341712
6 study2-data-final-2012-06-08.csv 2341711
7 study1-data-final-2012-06-08.csv 2341710
8 Study 2 Webpages.zip 2341714
9 Study 1 Webpages.zip 2341715
With a fileId
returned by dvExtractFileIds
, one can see the formats
available for a file (e.g., data can be downloaded in a number of
formats), using dvDownloadInfo
:
> dvDownloadInfo(files$fileId[1])
: study2-replication-analysis.r
File Name: 2341713
File ID: text/plain; charset=US-ASCII
File Type: 15699
File Size: anonymous
Authentication
Direct Access? false
Terms of Use apply.
:
Access Services
serviceName serviceArgs contentType1 termsofuse TermsOfUse=true text/plain
2 bundleTOU package=WithTermsOfUse application/zip
serviceDesc1 Terms of Use/Access Restrictions associated with the data file
2 Data File and the Terms of Use in a Zip archive
If allowed by Terms of Use, dvDownload
can be used to download a file
directly into memory. Otherwise the output of dvDownload
will advise
the user to download the data through a browser using the URI returned
by dvExtractFileIds
:
> dvDownload(files$fileId[1])
in dvDownload(files$fileId[1]) :
Error
Terms of Use apply.
Data cannot be accessed directly...dvExtractFileIds(dvMetadata())
try browsing URI from
> files$URI[1]
1] http://thedata.harvard.edu/dvn/dv/leeper/FileDownload/study2-
[-analysis.r?fileId=2341713 replication
While searching for extant data files is a helpful feature to have within R, the dvn wrappers for the Data Deposit API offer much more useful tools for archiving a reproducible research project. The basic workflow involves functions to:
create a study using appropriate metadata (dvCreateStudy
)
add one or more files to that study (dvAddFile
)
release the study to the public (dvReleaseStudy
)
Use of the Data Deposit functions requires authentication, which is currently provided by HTTP basic authentication, relying on the username and password associated with a Dataverse Network user account. The username and password can be included atomically in each function call or stored globally (to conserve typing and allow the sharing of code without credentials embedded in each function call):
options(dvn.user = "username")
options(dvn.pwd = "password")
As a simple check of authentication, one can call dvServiceDoc()
with
no arguments, which (if authentication is successful) returns a brief
statement confirming the Dataverse implementation and user account in
use along with a list of available dataverse collections, or otherwise
reports an HTTP 403 (Forbidden) code.
To create a study, we simply need to call dvCreateStudy
with some
associated metadata in Dublin Core format. We can either build a
metadata file manually or use dvBuildMetadata
to build that XML for
us. The only required field is “title.” The built metadata can then be
supplied to dvCreateStudy
, additionally specifying a named dataverse
to which the user has access:22
<- dvBuildMetadata(title = "My Study")
m <- dvCreateStudy("mydataverse", m) s
If successful, the response is a simple summary of the created study.
Included in the response is the new study’s objectId
, which uniquely
identifies the study. We can use either the objectId
or the return
value of dvCreateStudy
(an object of class "dvStudyAtom"
) to add
files and release the study.
We can add files in a number of ways and in a number of formats. For
example, we can add one or more files simply by specifying their
filenames in dvAddFile
. The filename is used to list the file in the
study. We can also send multiple files in a compressed zip directory,
which The Dataverse Network application will automatically unpack.23
Another option is to directly send a named dataframe directly from
memory:
# add files and release study using `objectid`
dvAddFile(s, "mydata.zip")
# or add multiple files:
dvAddFile(s, c("file1.csv", "file2.txt"))
# or add R dataframes as files:
<- data.frame(x = 1:10, y = 11:20)
mydf dvAddFile(s, dataframe = "mydf")
Adding files is non-destructive, so we can add files all at once in a
single .zip directory or we can add them sequentially. The Dataverse
Network does not overwrite an existing file. Attempting to do so will
result in an error and dvAddFile
will return NULL
. If a file fails
to upload successfully, dvAddFile
will return an error, and if a data
file uploads successfully but then fails to decompress or is in an
unrecognizable format, you will receive an email to the account
affiliated with your dataverse informing you of the failure.
With metadata and files added, we can release the study using
dvReleaseStudy
. Before doing so, it may be reasonable to check the
study using dvStudyStatement
to see a short summary of study contents:
> dvStudyStatement(s)
: Unknown
Study author: My Study
Study title: doi:10.7910/DVN/ILAJO
ObjectId: https://thedata.harvard.edu/dvn/api/data-deposit/v1/swordv2/edit/study/
Study URI:10.7910/DVN/ILAJO
doi: 2013-12-02T09:56:07.298Z
Last updated: DRAFT
Status
Locked? false :
Files
src 1 https://thedata.harvard.edu/dvn/api/data-deposit/v1/swordv2/edit-media/file/2349641/
mydf.tab
type updated fileId 1 text/tab-separated-values 2013-12-02T09:56:16.227Z 2349641
The above example output shows a variety of information, including the
lack of a study author, the state of the study as DRAFT
(meaning the
current version is unreleased), and a list of the included files (we see
the mydf
object uploaded above). If satisfied, we can release the
study:
dvReleaseStudy(s)
When a study is released, it becomes immediately accessible via its persistent Handle or DOI. If changes are made to the dataverse (either metadata or files), these remain in draft form until the dataverse is released again. Each release is persistently available unless the entire study collection is deleted.24
Before releasing a study, we can delete it from The Dataverse Network
server using dvDeleteStudy
. Once a study is released, it cannot be
deleted, but access to it can be revoked. Thus, calling dvDeleteStudy
only deletes studies that are unreleased. Calling dvDeleteStudy
on a
released study revokes public access, but a persistent page will remain
at the study URL noting that a study was previously released. Similarly,
it is possible to delete files using dvDeleteFile
with an argument
specifying the fileId
(which we can extract from dvStudyStatement
):
> d <- dvStudyStatement(s)
> d$files$fileId
1] "2349642" "2349641"
[> dvDeleteFile("2349642")
Operation appears to have succeeded.1] "" [
If a file deletion is successful, dvn will report a success message
and dvDeleteFile
will return an empty character string.25
We can also modify a study’s metadata using dvEditStudy
, supplied with
a metadata listing (e.g., as returned by dvBuildMetadata
).26
> m2 <- dvBuildMetadata(title = 'An Actual Title', creator = 'Me', date = '2013')
> dvEditStudy(s, m2)
: Me, 2013, "An Actual Title", http://dx.doi.org/10.7910/DVN/ILAJO V1
Citation: doi:10.7910/DVN/ILAJO
ObjectId: https://thedata.harvard.edu/dvn/api/data-deposit/v1/swordv2/edit/study/
Study URI:10.7910/DVN/ILAJO
doi: http://www.swordapp.org/ 2.0 Generated by
The response reflects the updated metadata. If changes are made to a
study (files are added or deleted, or metadata is modified), those
changes will be visible via dvn as a DRAFT
study version, but will
not be public. To make them public, simply release the study again with
dvReleaseStudy
. Both study versions will remain accessible via The
Dataverse Network web browser interface.
Once we’ve created one or more studies (regardless of whether they are
released), we can view them using dvUserStudies
for a named dataverse
or from a dvServiceDoc
response:
dvUserStudies("mydataverse")
# or:
dvUserStudies(dvServiceDoc())
Thus while we only need three core functions (dvCreateStudy
,
dvAddFile
, and dvReleaseStudy
) to complete a reproducible research
workflow, the other functions supplied by dvn allow a great degree of
control over the appearance and contents of a study.
As Gandrud (2013b) has demonstrated, R is a powerful tool in the reproducible research workflow. Yet a missing element of that workflow to-date has been the easy ability to store research files in a persistent, public data repository designed for reproducible research. The Dataverse Network is a free, open-source service explicitly dedicated to permanently storing data, with implementations hosted by a variety of institutions (mostly research universities) offering additional choice over whom to trust with the persistent storage of study records. By automatically providing study records a unique URI, data stored using The Dataverse Network has a version-controlled and persistent identifier that can be cited by subsequent users of the data, continuing the reproducible workflow process beyond the phase of data production and initial analysis. With the popularity of The Dataverse Network increasing, it is a logical choice for archiving one’s reproducible data and analysis files. dvn thus provides R users with the tools necessary to quickly and easily integrate data archiving into the reproducible research workflow.
dvn, knitr, rfigshare, RCurl, XML, rdryad, OAIHarvester
ReproducibleResearch, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Leeper, "Archiving Reproducible Research with R and Dataverse", The R Journal, 2014
BibTeX citation
@article{RJ-2014-015, author = {Leeper, Thomas J.}, title = {Archiving Reproducible Research with R and Dataverse}, journal = {The R Journal}, year = {2014}, note = {https://rjournal.github.io/}, volume = {6}, issue = {1}, issn = {2073-4859}, pages = {151-158} }