I present the small R package digitize, designed to extract data from scatterplots with a simple method and suited to small datasets. I present an application of this method to the extraction of data from a graph whose source is not available.
The package digitize,
that I present here, allows a user to load a graphical file of a
scatterplot (with the help of the read.jpeg
function of the
ReadImages package)
in the graphical window of R, and to use the locator
function to
calibrate and extract the data. Calibration is done by setting four
reference points on the original graph axis, two for the
This is useful when working on data that are not available in digital form, e.g. when integrating old papers in meta-analyses. Several commercial or free software packages allow a user to extract data from a plot in image format, among which we can cite PlotDigitizer (http://plotdigitizer.sourceforge.net/) or the commercial package GraphClick (http://www.arizona-software.ch/graphclick/). While these programs are powerful and quite ergonomic, for some lightweight use, one may want to load the graph directly into R, and as a result get the data directly in R format. This paper presents a rapid digitization of a scatterplot and subsequent statistical analysis of the data. As an example, we will use the data presented by Jacques Monod in a seminal microbiology paper (Monod 1949).
The original paper presents the growth rate (in terms of divisions per
hour) of the bacterium Escherichia coli in media of increasing glucose
concentration. Such a hyperbolic relationship is best represented by the
equation
MonodGrowth <- function(params, M) \{
with(params, rK*(M/(M1+M)))
\}
In order to characterize the growth parameters of a bacterial
population, one can measure its growth rate in different concentrations
of nutrients. Monod (1949) proposed that, in the measured population,
Values of
MonodError <- function(params, M, y) \{
with(params,
sum((MonodGrowth(params, M)-y)^2))
\}
The first step when using the
digitize package is to
specify four points on the graph that will be used to calibrate the
axes. They must be in the following order : leftmost
Loading the figure and printing it in the current device for calibration is done by
cal <- ReadAndCal('monod.jpg')
Once the graph appears in the window, the user must input (by clicking
on them) the four calibration points, marked as blue crosses. The
calibration values will be stocked in the cal
object, which is a list
with DigitData
function, whose arguments are the type of lines/points
drawn.
growth <- DigitData(col = 'red')
When all the points have been identified, the digitization is stopped in
the same way that one stops the locator
function, i.e. either by
right-clicking or pressing the Esc
key (see ?locator
). The outcome
of this step is shown in figure 1. The next step for these
data to be exploitable is to use the calibration information to have the
correct coordinates of the points. We can do this by using the function
Calibrate(data,calpoints,x1,x2,y1,y2)
, and the correct coordinates of
the points of calibration (x1
, x2
, y1
and y2
correspond,
respectively, to the points
data <- Calibrate(growth, cal, 1, 8, 0.5, 1)
The internals of the function Calibrate
are quite simple. If we
consider x
against X
allows us to determine the
coefficients to convert the data. The same procedure is repeated for the
The object returned by Calibrate
is of class "data.frame"
, with
columns "x"
and "y"
representing the out
is a list containing the arguments
of MonodGrowth
obtained after optimization (out$set
), and paper
is
the same list with the values of Monod (1949).
plot(data$x, data$y, pch=20, col='grey',
xlab = 'Nutrients concentration',
ylab = 'Divisions per hour')
points(xcal, MonodGrowth(out\$set, xcal),
type = 'l', lty = 1, lwd = 2)
points(xcal, MonodGrowth(paper, xcal),
type = 'l', lty = 2)
legend('bottomright',
legend = c('data', 'best fit', 'paper value'),
pch = c(20, NA, NA),
lty = c(NA, 1, 2),
lwd = c(NA, 2, 1),
col = c('grey', 'black', 'black'),
bty = 'n')
While using the MonodError
function with the proposed parameters
yields a sum of squares of
I presented an example of using the digitize package to reproduce data in an ancient paper that are not available in digital form. While the principle of this package is really simple, it allows for a quick extraction of data from a scatterplot (or any kind of planar display), which can be useful when data from old or proprietary figures are needed.
Thanks are due to two anonymous referees for their comments on an earlier version of this manuscript.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Poisot, "The digitize Package: Extracting Numerical Data from Scatterplots", The R Journal, 2011
BibTeX citation
@article{RJ-2011-004, author = {Poisot, Timothée}, title = {The digitize Package: Extracting Numerical Data from Scatterplots}, journal = {The R Journal}, year = {2011}, note = {https://rjournal.github.io/}, volume = {3}, issue = {1}, issn = {2073-4859}, pages = {25-26} }