I present the small R package digitize, designed to extract data from scatterplots with a simple method and suited to small datasets. I present an application of this method to the extraction of data from a graph whose source is not available.
The package digitize,
that I present here, allows a user to load a graphical file of a
scatterplot (with the help of the read.jpeg
function of the
ReadImages package)
in the graphical window of R, and to use the locator
function to
calibrate and extract the data. Calibration is done by setting four
reference points on the original graph axis, two for the \(x\) values and
two for the \(y\) values. The use of four points for calibration is
justified by the fact that it makes calibrations on the axis possible,
as \(y\) data are not taken into account for calibration of the \(x\) axis,
and vice versa.
This is useful when working on data that are not available in digital form, e.g. when integrating old papers in meta-analyses. Several commercial or free software packages allow a user to extract data from a plot in image format, among which we can cite PlotDigitizer (http://plotdigitizer.sourceforge.net/) or the commercial package GraphClick (http://www.arizona-software.ch/graphclick/). While these programs are powerful and quite ergonomic, for some lightweight use, one may want to load the graph directly into R, and as a result get the data directly in R format. This paper presents a rapid digitization of a scatterplot and subsequent statistical analysis of the data. As an example, we will use the data presented by Jacques Monod in a seminal microbiology paper (Monod 1949).
The original paper presents the growth rate (in terms of divisions per hour) of the bacterium Escherichia coli in media of increasing glucose concentration. Such a hyperbolic relationship is best represented by the equation \[R=R_{K} \; \frac{C}{C_{1}+C},\] where \(R\) is the growth rate at a given concentration of nutrients \(C\), \(R_{K}\) is the maximal growth rate, \(C_{1}\) is the concentration of nutrients at which \(R=0.5 R_{K}\). In R, this function is written as
<- function(params, M) \{
MonodGrowth with(params, rK*(M/(M1+M)))
\}
In order to characterize the growth parameters of a bacterial population, one can measure its growth rate in different concentrations of nutrients. Monod (1949) proposed that, in the measured population, \(R_{K} = 1.35\) and \(C_{1} =22 \times 10^{-6}\). By using the digitize package to extract the data from this paper, we will be able to get our own estimates for these parameters.
Values of \(R_{K}\) and \(C_{1}\) were estimated using a simple genetic algorithm, which minimizes the error function (sum of squared errors) defined by
<- function(params, M, y) \{
MonodError with(params,
sum((MonodGrowth(params, M)-y)^2))
\}
The first step when using the digitize package is to specify four points on the graph that will be used to calibrate the axes. They must be in the following order : leftmost \(x\), rightmost \(x\), lower \(y\), upper \(y\). For the first two of them, the \(y\) value is not important (and vice versa). For this example, it is assumed that we set the first two points at \(x_{1}=1\) and \(x_{2}=8\), and the two last points at \(y_{1}=0.5\) and \(y_{2}=1\), simply by clicking in the graphical window at these positions (preferentially on the axes). It should be noted that it is not necessary to calibrate using the extremity of the axes.
Loading the figure and printing it in the current device for calibration is done by
<- ReadAndCal('monod.jpg') cal
Once the graph appears in the window, the user must input (by clicking
on them) the four calibration points, marked as blue crosses. The
calibration values will be stocked in the cal
object, which is a list
with \(x\) and \(y\) values. The next step is to read the data, simply by
pointing and clicking on the graph. This is done by calling the
DigitData
function, whose arguments are the type of lines/points
drawn.
<- DigitData(col = 'red') growth
When all the points have been identified, the digitization is stopped in
the same way that one stops the locator
function, i.e. either by
right-clicking or pressing the Esc
key (see ?locator
). The outcome
of this step is shown in figure 1. The next step for these
data to be exploitable is to use the calibration information to have the
correct coordinates of the points. We can do this by using the function
Calibrate(data,calpoints,x1,x2,y1,y2)
, and the correct coordinates of
the points of calibration (x1
, x2
, y1
and y2
correspond,
respectively, to the points \(x_{1}\), \(x_{2}\), \(y_{1}\) and \(y_{2}\)).
<- Calibrate(growth, cal, 1, 8, 0.5, 1) data
The internals of the function Calibrate
are quite simple. If we
consider \(X = \left(X_{1},X_{2}\right)\) a vector containing the \(x\)
coordinates of the calibration points for the \(x\) axis on the graphic
window, and \(x = \left(x_{1},x_{2}\right)\) a vector with their true
value, it is straightforward that \(x=aX+b\). As such, performing a simple
linear regression of x
against X
allows us to determine the
coefficients to convert the data. The same procedure is repeated for the
\(y\) axis. One advantage of this method of calibration is that you do not
need to focus on the \(y\) value of the points used for \(x\) calibration,
and reciprocally. It means that in order to accurately calibrate a
graph, you only need to have two \(x\) coordinates, and two \(y\)
coordinates. Eventually, it is very simple to calibrate the graphic by
setting the calibration points directly on the tick mark of their axis.
The object returned by Calibrate
is of class "data.frame"
, with
columns "x"
and "y"
representing the \(x\) and \(y\) coordinates so that
we can plot it directly. The following code produces Figure
2, assuming that out
is a list containing the arguments
of MonodGrowth
obtained after optimization (out$set
), and paper
is
the same list with the values of Monod (1949).
plot(data$x, data$y, pch=20, col='grey',
xlab = 'Nutrients concentration',
ylab = 'Divisions per hour')
points(xcal, MonodGrowth(out\$set, xcal),
type = 'l', lty = 1, lwd = 2)
points(xcal, MonodGrowth(paper, xcal),
type = 'l', lty = 2)
legend('bottomright',
legend = c('data', 'best fit', 'paper value'),
pch = c(20, NA, NA),
lty = c(NA, 1, 2),
lwd = c(NA, 2, 1),
col = c('grey', 'black', 'black'),
bty = 'n')
While using the MonodError
function with the proposed parameters
yields a sum of squares of \(1.32\), our estimated parameters minimizes
this value to \(0.45\), thus suggesting that the values of \(R_{K}\) and
\(C_{1}\) presented in the paper are not optimal given the data.
I presented an example of using the digitize package to reproduce data in an ancient paper that are not available in digital form. While the principle of this package is really simple, it allows for a quick extraction of data from a scatterplot (or any kind of planar display), which can be useful when data from old or proprietary figures are needed.
Thanks are due to two anonymous referees for their comments on an earlier version of this manuscript.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Poisot, "The digitize Package: Extracting Numerical Data from Scatterplots", The R Journal, 2011
BibTeX citation
@article{RJ-2011-004, author = {Poisot, Timothée}, title = {The digitize Package: Extracting Numerical Data from Scatterplots}, journal = {The R Journal}, year = {2011}, note = {https://rjournal.github.io/}, volume = {3}, issue = {1}, issn = {2073-4859}, pages = {25-26} }